心理测量学衍生的60个问题基准:实质性的效率和人类与人工智能比较的可能性

IF 3.3 2区 心理学 Q1 PSYCHOLOGY, MULTIDISCIPLINARY
Gilles E. Gignac , David Ilić
{"title":"心理测量学衍生的60个问题基准:实质性的效率和人类与人工智能比较的可能性","authors":"Gilles E. Gignac ,&nbsp;David Ilić","doi":"10.1016/j.intell.2025.101922","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (<em>r</em> ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.</div></div>","PeriodicalId":13862,"journal":{"name":"Intelligence","volume":"110 ","pages":"Article 101922"},"PeriodicalIF":3.3000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons\",\"authors\":\"Gilles E. Gignac ,&nbsp;David Ilić\",\"doi\":\"10.1016/j.intell.2025.101922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (<em>r</em> ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.</div></div>\",\"PeriodicalId\":13862,\"journal\":{\"name\":\"Intelligence\",\"volume\":\"110 \",\"pages\":\"Article 101922\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligence\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016028962500025X\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHOLOGY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence","FirstCategoryId":"102","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016028962500025X","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

大型语言模型(LLM)基准评估测试通常包含数千个问题。根据心理测量学原理,可靠和有效的基准测试可能只需要60个项目,而人类智力测试通常只包括15到60个项目。建立更短的基准测试提供了许多潜在的好处,包括更有效地评估LLM、创建并行表单的实际可行性,以及直接将LLM性能与人类能力进行比较的能力。因此,我们分析了591名法学硕士在三个广泛认可的基准(hellaswag、Winogrande和gsm8k)上的表现,并使用心理测量学原理开发了简短的表格(每个约60个问题)。短格式具有较高的内部一致性信度,Winogrande的欧米茄系数为0.96,HellaSwag和GSM8K的欧米茄系数为0.99。此外,短格式和长格式分数之间的强相关性(r≈0.90)提供了并发效度的证据。最后,与长格式相比,模型大小(参数数量)对短格式的整体LLM性能的预测略强,这表明短格式表现出相当的收敛效度,如果不是略优的话。结论是,更短的基准可以通过更有效的评估来加速人工智能的发展。此外,通过直接比较人工智能和人类的表现,基准简短形式可能会促进对智能本质的研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons
Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (r ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Intelligence
Intelligence PSYCHOLOGY, MULTIDISCIPLINARY-
CiteScore
5.80
自引率
13.30%
发文量
64
审稿时长
69 days
期刊介绍: This unique journal in psychology is devoted to publishing original research and theoretical studies and review papers that substantially contribute to the understanding of intelligence. It provides a new source of significant papers in psychometrics, tests and measurement, and all other empirical and theoretical studies in intelligence and mental retardation.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信