GPT-4V在现象学和神经层面显示出与人类相似的社会感知能力。

Imaging neuroscience (Cambridge, Mass.) Pub Date : 2025-09-02 eCollection Date: 2025-01-01 DOI:10.1162/IMAG.a.134

Severi Santavirta, Yuhang Wu, Lauri Suominen, Lauri Nummenmaa

{"title":"GPT-4V在现象学和神经层面显示出与人类相似的社会感知能力。","authors":"Severi Santavirta, Yuhang Wu, Lauri Suominen, Lauri Nummenmaa","doi":"10.1162/IMAG.a.134","DOIUrl":null,"url":null,"abstract":"Humans navigate the social world by rapidly perceiving social features from other people and their interaction. Recently, large-language models (LLMs) have achieved high-level visual capabilities for detailed object and scene content recognition and description. This raises the question whether LLMs can infer complex social information from images and videos, and whether the high-dimensional structure of the feature annotations aligns with that of humans. We collected evaluations for 138 social features from GPT-4V for images (N = 468) and videos (N = 234) that are derived from social movie scenes. These evaluations were compared with human evaluations (N = 2,254). The comparisons established that GPT-4V can achieve human-like capabilities at annotating individual social features. The GPT-4V social feature annotations also express similar structural representation compared to the human social perceptual structure (i.e., similar correlation matrix over all social feature annotations). Finally, we modeled hemodynamic responses (N = 97) to viewing socioemotional movie clips with feature annotations by human observers and GPT-4V. These results demonstrated that GPT-4V based stimulus models can also reveal the social perceptual network in the human brain highly similar to the stimulus models based on human annotations. These human-like annotation capabilities of LLMs could have a wide range of real-life applications ranging from health care to business and would open exciting new avenues for psychological and neuroscientific research.","PeriodicalId":73341,"journal":{"name":"Imaging neuroscience (Cambridge, Mass.)","volume":"3 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12410153/pdf/","citationCount":"0","resultStr":"{\"title\":\"GPT-4V shows human-like social perceptual capabilities at phenomenological and neural levels.\",\"authors\":\"Severi Santavirta, Yuhang Wu, Lauri Suominen, Lauri Nummenmaa\",\"doi\":\"10.1162/IMAG.a.134\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Humans navigate the social world by rapidly perceiving social features from other people and their interaction. Recently, large-language models (LLMs) have achieved high-level visual capabilities for detailed object and scene content recognition and description. This raises the question whether LLMs can infer complex social information from images and videos, and whether the high-dimensional structure of the feature annotations aligns with that of humans. We collected evaluations for 138 social features from GPT-4V for images (N = 468) and videos (N = 234) that are derived from social movie scenes. These evaluations were compared with human evaluations (N = 2,254). The comparisons established that GPT-4V can achieve human-like capabilities at annotating individual social features. The GPT-4V social feature annotations also express similar structural representation compared to the human social perceptual structure (i.e., similar correlation matrix over all social feature annotations). Finally, we modeled hemodynamic responses (N = 97) to viewing socioemotional movie clips with feature annotations by human observers and GPT-4V. These results demonstrated that GPT-4V based stimulus models can also reveal the social perceptual network in the human brain highly similar to the stimulus models based on human annotations. These human-like annotation capabilities of LLMs could have a wide range of real-life applications ranging from health care to business and would open exciting new avenues for psychological and neuroscientific research.\",\"PeriodicalId\":73341,\"journal\":{\"name\":\"Imaging neuroscience (Cambridge, Mass.)\",\"volume\":\"3 \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12410153/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Imaging neuroscience (Cambridge, Mass.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1162/IMAG.a.134\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Imaging neuroscience (Cambridge, Mass.)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1162/IMAG.a.134","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

人类通过快速感知他人的社会特征以及他们之间的互动来驾驭社会世界。最近，大语言模型（llm）已经实现了详细对象和场景内容识别和描述的高级视觉能力。这就提出了一个问题，llm是否可以从图像和视频中推断出复杂的社会信息，以及特征注释的高维结构是否与人类的结构一致。我们收集了GPT-4V对来自社交电影场景的图像（N = 468）和视频（N = 234）的138个社交特征的评估。将这些评价与人类评价进行比较（N = 2254）。这些比较表明，GPT-4V可以在注释个人社交特征方面实现类似人类的能力。GPT-4V社会特征标注也表达了与人类社会感知结构相似的结构表征（即所有社会特征标注上相似的关联矩阵）。最后，我们用人类观察者和GPT-4V的特征注释模拟了观看社会情感电影片段时的血流动力学反应（N = 97）。这些结果表明，基于GPT-4V的刺激模型也可以揭示人脑的社会知觉网络，与基于人类注释的刺激模型高度相似。法学硕士的这些类似人类的注释能力可以在现实生活中有广泛的应用，从医疗保健到商业，并将为心理学和神经科学研究开辟令人兴奋的新途径。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GPT-4V shows human-like social perceptual capabilities at phenomenological and neural levels.

Humans navigate the social world by rapidly perceiving social features from other people and their interaction. Recently, large-language models (LLMs) have achieved high-level visual capabilities for detailed object and scene content recognition and description. This raises the question whether LLMs can infer complex social information from images and videos, and whether the high-dimensional structure of the feature annotations aligns with that of humans. We collected evaluations for 138 social features from GPT-4V for images (N = 468) and videos (N = 234) that are derived from social movie scenes. These evaluations were compared with human evaluations (N = 2,254). The comparisons established that GPT-4V can achieve human-like capabilities at annotating individual social features. The GPT-4V social feature annotations also express similar structural representation compared to the human social perceptual structure (i.e., similar correlation matrix over all social feature annotations). Finally, we modeled hemodynamic responses (N = 97) to viewing socioemotional movie clips with feature annotations by human observers and GPT-4V. These results demonstrated that GPT-4V based stimulus models can also reveal the social perceptual network in the human brain highly similar to the stimulus models based on human annotations. These human-like annotation capabilities of LLMs could have a wide range of real-life applications ranging from health care to business and would open exciting new avenues for psychological and neuroscientific research.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Imaging neuroscience (Cambridge, Mass.)

自引率

0.00%

发文量