Comparing perceptual judgments in large multimodal models and humans.

IF 3.9 2区 心理学 Q1 PSYCHOLOGY, EXPERIMENTAL
Billy Dickson, Sahaj Singh Maini, Craig Sanders, Robert Nosofsky, Zoran Tiganj
{"title":"Comparing perceptual judgments in large multimodal models and humans.","authors":"Billy Dickson, Sahaj Singh Maini, Craig Sanders, Robert Nosofsky, Zoran Tiganj","doi":"10.3758/s13428-025-02728-w","DOIUrl":null,"url":null,"abstract":"<p><p>Cognitive scientists commonly collect participants' judgments regarding perceptual characteristics of stimuli to develop and evaluate models of attention, memory, learning, and decision-making. For instance, to model human responses in tasks of category learning and item recognition, researchers often collect perceptual judgments of images in order to embed the images in multidimensional feature spaces. This process is time-consuming and costly. Recent advancements in large multimodal models (LMMs) provide a potential alternative because such models can respond to prompts that include both text and images and could potentially replace human participants. To test whether the available LMMs can indeed be useful for this purpose, we evaluated their judgments on a dataset consisting of rock images that has been widely used by cognitive scientists. The dataset includes human perceptual judgments along 10 dimensions considered important for classifying rock images. Among the LMMs that we investigated, GPT-4o exhibited the strongest positive correlation with human responses and demonstrated promising alignment with the mean ratings from human participants, particularly for elementary dimensions such as lightness, chromaticity, shininess, and fine/coarse grain texture. However, its correlations with human ratings were lower for more abstract and rock-specific emergent dimensions such as organization and pegmatitic structure. Although there is room for further improvement, the model already appears to be approaching the level of consensus observed across human groups for the perceptual features examined here. Our study provides a benchmark for evaluating future LMMs on human perceptual judgment data.</p>","PeriodicalId":8717,"journal":{"name":"Behavior Research Methods","volume":"57 7","pages":"203"},"PeriodicalIF":3.9000,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12178973/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Behavior Research Methods","FirstCategoryId":"102","ListUrlMain":"https://doi.org/10.3758/s13428-025-02728-w","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, EXPERIMENTAL","Score":null,"Total":0}
引用次数: 0

Abstract

Cognitive scientists commonly collect participants' judgments regarding perceptual characteristics of stimuli to develop and evaluate models of attention, memory, learning, and decision-making. For instance, to model human responses in tasks of category learning and item recognition, researchers often collect perceptual judgments of images in order to embed the images in multidimensional feature spaces. This process is time-consuming and costly. Recent advancements in large multimodal models (LMMs) provide a potential alternative because such models can respond to prompts that include both text and images and could potentially replace human participants. To test whether the available LMMs can indeed be useful for this purpose, we evaluated their judgments on a dataset consisting of rock images that has been widely used by cognitive scientists. The dataset includes human perceptual judgments along 10 dimensions considered important for classifying rock images. Among the LMMs that we investigated, GPT-4o exhibited the strongest positive correlation with human responses and demonstrated promising alignment with the mean ratings from human participants, particularly for elementary dimensions such as lightness, chromaticity, shininess, and fine/coarse grain texture. However, its correlations with human ratings were lower for more abstract and rock-specific emergent dimensions such as organization and pegmatitic structure. Although there is room for further improvement, the model already appears to be approaching the level of consensus observed across human groups for the perceptual features examined here. Our study provides a benchmark for evaluating future LMMs on human perceptual judgment data.

比较大型多模态模型和人类的感知判断。
认知科学家通常收集参与者对刺激的感知特征的判断,以开发和评估注意力、记忆、学习和决策的模型。例如,为了模拟人类在类别学习和物品识别任务中的反应,研究人员经常收集图像的感知判断,以便将图像嵌入到多维特征空间中。这个过程既耗时又昂贵。大型多模态模型(lmm)的最新进展提供了一种潜在的替代方案,因为这种模型可以对包括文本和图像在内的提示做出响应,并且有可能取代人类参与者。为了测试可用的lmm是否确实可以用于此目的,我们在认知科学家广泛使用的由岩石图像组成的数据集上评估了它们的判断。该数据集包括人类对岩石图像分类重要的10个维度的感知判断。在我们研究的lmm中,gpt - 40与人类反应表现出最强的正相关性,并与人类参与者的平均评分表现出良好的一致性,特别是在亮度、色度、亮度和细/粗颗粒纹理等基本维度上。然而,对于更抽象和岩石特定的紧急维度,如组织和伟晶质结构,其与人类评级的相关性较低。虽然还有进一步改进的空间,但该模型似乎已经接近在人类群体中观察到的感知特征的共识水平。我们的研究为未来基于人类感知判断数据的lmm评估提供了一个基准。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
10.30
自引率
9.30%
发文量
266
期刊介绍: Behavior Research Methods publishes articles concerned with the methods, techniques, and instrumentation of research in experimental psychology. The journal focuses particularly on the use of computer technology in psychological research. An annual special issue is devoted to this field.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信