心理测量学衍生的60个问题基准：实质性的效率和人类与人工智能比较的可能性

IF 2.8 2区心理学 Q1 PSYCHOLOGY, MULTIDISCIPLINARY

Intelligence Pub Date : 2025-05-01 DOI:10.1016/j.intell.2025.101922

Gilles E. Gignac , David Ilić

{"title":"心理测量学衍生的60个问题基准：实质性的效率和人类与人工智能比较的可能性","authors":"Gilles E. Gignac , David Ilić","doi":"10.1016/j.intell.2025.101922","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (<em>r</em> ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.</div></div>","PeriodicalId":13862,"journal":{"name":"Intelligence","volume":"110 ","pages":"Article 101922"},"PeriodicalIF":2.8000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons\",\"authors\":\"Gilles E. Gignac , David Ilić\",\"doi\":\"10.1016/j.intell.2025.101922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (<em>r</em> ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.</div></div>\",\"PeriodicalId\":13862,\"journal\":{\"name\":\"Intelligence\",\"volume\":\"110 \",\"pages\":\"Article 101922\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligence\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016028962500025X\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHOLOGY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence","FirstCategoryId":"102","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016028962500025X","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）基准评估测试通常包含数千个问题。根据心理测量学原理，可靠和有效的基准测试可能只需要60个项目，而人类智力测试通常只包括15到60个项目。建立更短的基准测试提供了许多潜在的好处，包括更有效地评估LLM、创建并行表单的实际可行性，以及直接将LLM性能与人类能力进行比较的能力。因此，我们分析了591名法学硕士在三个广泛认可的基准（hellaswag、Winogrande和gsm8k）上的表现，并使用心理测量学原理开发了简短的表格（每个约60个问题）。短格式具有较高的内部一致性信度，Winogrande的欧米茄系数为0.96，HellaSwag和GSM8K的欧米茄系数为0.99。此外，短格式和长格式分数之间的强相关性（r≈0.90）提供了并发效度的证据。最后，与长格式相比，模型大小（参数数量）对短格式的整体LLM性能的预测略强，这表明短格式表现出相当的收敛效度，如果不是略优的话。结论是，更短的基准可以通过更有效的评估来加速人工智能的发展。此外，通过直接比较人工智能和人类的表现，基准简短形式可能会促进对智能本质的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons

Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (r ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Intelligence PSYCHOLOGY, MULTIDISCIPLINARY-

CiteScore

5.80

自引率

13.30%

发文量

审稿时长

69 days

期刊介绍： This unique journal in psychology is devoted to publishing original research and theoretical studies and review papers that substantially contribute to the understanding of intelligence. It provides a new source of significant papers in psychometrics, tests and measurement, and all other empirical and theoretical studies in intelligence and mental retardation.