用深度神经网络逼近人类水平的三维视觉推理。

Q1 Social Sciences

Open Mind Pub Date : 2025-02-16 eCollection Date: 2025-01-01 DOI:10.1162/opmi_a_00189

Thomas P O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Vincent Sitzmann, Joshua B Tenenbaum, Nancy Kanwisher

{"title":"用深度神经网络逼近人类水平的三维视觉推理。","authors":"Thomas P O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Vincent Sitzmann, Joshua B Tenenbaum, Nancy Kanwisher","doi":"10.1162/opmi_a_00189","DOIUrl":null,"url":null,"abstract":"Humans make rich inferences about the geometry of the visual world. While deep neural networks (DNNs) achieve human-level performance on some psychophysical tasks (e.g., rapid classification of object or scene categories), they often fail in tasks requiring inferences about the underlying shape of objects or scenes. Here, we ask whether and how this gap in 3D shape representation between DNNs and humans can be closed. First, we define the problem space: after generating a stimulus set to evaluate 3D shape inferences using a match-to-sample task, we confirm that standard DNNs are unable to reach human performance. Next, we construct a set of candidate 3D-aware DNNs including 3D neural field (Light Field Network), autoencoder, and convolutional architectures. We investigate the role of the learning objective and dataset by training single-view (the model only sees one viewpoint of an object per training trial) and multi-view (the model is trained to associate multiple viewpoints of each object per training trial) versions of each architecture. When the same object categories appear in the model training and match-to-sample test sets, multi-view DNNs approach human-level performance for 3D shape matching, highlighting the importance of a learning objective that enforces a common representation across viewpoints of the same object. Furthermore, the 3D Light Field Network was the model most similar to humans across all tests, suggesting that building in 3D inductive biases increases human-model alignment. Finally, we explore the generalization performance of multi-view DNNs to out-of-distribution object categories not seen during training. Overall, our work shows that multi-view learning objectives for DNNs are necessary but not sufficient to make similar 3D shape inferences as humans and reveals limitations in capturing human-like shape inferences that may be inherent to DNN modeling approaches. We provide a methodology for understanding human 3D shape perception within a deep learning framework and highlight out-of-domain generalization as the next challenge for learning human-like 3D representations with DNNs.","PeriodicalId":32558,"journal":{"name":"Open Mind","volume":"9 ","pages":"305-324"},"PeriodicalIF":0.0000,"publicationDate":"2025-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11864798/pdf/","citationCount":"0","resultStr":"{\"title\":\"Approximating Human-Level 3D Visual Inferences With Deep Neural Networks.\",\"authors\":\"Thomas P O'Connell, Tyler Bonnen, Yoni Friedman, Ayush Tewari, Vincent Sitzmann, Joshua B Tenenbaum, Nancy Kanwisher\",\"doi\":\"10.1162/opmi_a_00189\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Humans make rich inferences about the geometry of the visual world. While deep neural networks (DNNs) achieve human-level performance on some psychophysical tasks (e.g., rapid classification of object or scene categories), they often fail in tasks requiring inferences about the underlying shape of objects or scenes. Here, we ask whether and how this gap in 3D shape representation between DNNs and humans can be closed. First, we define the problem space: after generating a stimulus set to evaluate 3D shape inferences using a match-to-sample task, we confirm that standard DNNs are unable to reach human performance. Next, we construct a set of candidate 3D-aware DNNs including 3D neural field (Light Field Network), autoencoder, and convolutional architectures. We investigate the role of the learning objective and dataset by training single-view (the model only sees one viewpoint of an object per training trial) and multi-view (the model is trained to associate multiple viewpoints of each object per training trial) versions of each architecture. When the same object categories appear in the model training and match-to-sample test sets, multi-view DNNs approach human-level performance for 3D shape matching, highlighting the importance of a learning objective that enforces a common representation across viewpoints of the same object. Furthermore, the 3D Light Field Network was the model most similar to humans across all tests, suggesting that building in 3D inductive biases increases human-model alignment. Finally, we explore the generalization performance of multi-view DNNs to out-of-distribution object categories not seen during training. Overall, our work shows that multi-view learning objectives for DNNs are necessary but not sufficient to make similar 3D shape inferences as humans and reveals limitations in capturing human-like shape inferences that may be inherent to DNN modeling approaches. We provide a methodology for understanding human 3D shape perception within a deep learning framework and highlight out-of-domain generalization as the next challenge for learning human-like 3D representations with DNNs.\",\"PeriodicalId\":32558,\"journal\":{\"name\":\"Open Mind\",\"volume\":\"9 \",\"pages\":\"305-324\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11864798/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Open Mind\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1162/opmi_a_00189\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q1\",\"JCRName\":\"Social Sciences\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Mind","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/opmi_a_00189","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"Social Sciences","Score":null,"Total":0}

引用次数: 0

摘要

人类对视觉世界的几何图形有丰富的推论。虽然深度神经网络（dnn）在一些心理物理任务（例如，物体或场景类别的快速分类）上达到了人类水平的表现，但它们在需要推断物体或场景的潜在形状的任务中经常失败。在这里，我们询问dnn和人类之间在3D形状表示方面的差距是否以及如何缩小。首先，我们定义了问题空间：在使用匹配样本任务生成一个刺激集来评估3D形状推断之后，我们确认标准dnn无法达到人类的表现。接下来，我们构建了一组候选的3D感知dnn，包括3D神经场（光场网络）、自动编码器和卷积架构。我们通过训练每个架构的单视图（每次训练试验模型只看到一个对象的一个视点）和多视图（每次训练试验模型被训练以关联每个对象的多个视点）版本来研究学习目标和数据集的作用。当相同的对象类别出现在模型训练和匹配样本测试集中时，多视图dnn在3D形状匹配方面接近人类水平的性能，突出了学习目标的重要性，该目标强制在同一对象的多个视点之间进行共同表示。此外，在所有测试中，3D光场网络是与人类最相似的模型，这表明建立3D归纳偏差增加了人类模型的一致性。最后，我们探讨了多视图dnn对训练中未见的分布外对象类别的泛化性能。总的来说，我们的工作表明，DNN的多视图学习目标是必要的，但不足以做出与人类相似的3D形状推断，并揭示了捕获类人形状推断的局限性，这可能是DNN建模方法固有的。我们提供了一种在深度学习框架内理解人类3D形状感知的方法，并强调域外泛化是使用dnn学习类人3D表示的下一个挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Approximating Human-Level 3D Visual Inferences With Deep Neural Networks.

Humans make rich inferences about the geometry of the visual world. While deep neural networks (DNNs) achieve human-level performance on some psychophysical tasks (e.g., rapid classification of object or scene categories), they often fail in tasks requiring inferences about the underlying shape of objects or scenes. Here, we ask whether and how this gap in 3D shape representation between DNNs and humans can be closed. First, we define the problem space: after generating a stimulus set to evaluate 3D shape inferences using a match-to-sample task, we confirm that standard DNNs are unable to reach human performance. Next, we construct a set of candidate 3D-aware DNNs including 3D neural field (Light Field Network), autoencoder, and convolutional architectures. We investigate the role of the learning objective and dataset by training single-view (the model only sees one viewpoint of an object per training trial) and multi-view (the model is trained to associate multiple viewpoints of each object per training trial) versions of each architecture. When the same object categories appear in the model training and match-to-sample test sets, multi-view DNNs approach human-level performance for 3D shape matching, highlighting the importance of a learning objective that enforces a common representation across viewpoints of the same object. Furthermore, the 3D Light Field Network was the model most similar to humans across all tests, suggesting that building in 3D inductive biases increases human-model alignment. Finally, we explore the generalization performance of multi-view DNNs to out-of-distribution object categories not seen during training. Overall, our work shows that multi-view learning objectives for DNNs are necessary but not sufficient to make similar 3D shape inferences as humans and reveals limitations in capturing human-like shape inferences that may be inherent to DNN modeling approaches. We provide a methodology for understanding human 3D shape perception within a deep learning framework and highlight out-of-domain generalization as the next challenge for learning human-like 3D representations with DNNs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Open Mind Social Sciences-Linguistics and Language

CiteScore

3.20

自引率

0.00%

发文量

审稿时长

53 weeks