{"title":"心理测量学衍生的60个问题基准:实质性的效率和人类与人工智能比较的可能性","authors":"Gilles E. Gignac , David Ilić","doi":"10.1016/j.intell.2025.101922","DOIUrl":null,"url":null,"abstract":"<div><div>Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (<em>r</em> ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.</div></div>","PeriodicalId":13862,"journal":{"name":"Intelligence","volume":"110 ","pages":"Article 101922"},"PeriodicalIF":3.3000,"publicationDate":"2025-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons\",\"authors\":\"Gilles E. Gignac , David Ilić\",\"doi\":\"10.1016/j.intell.2025.101922\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (<em>r</em> ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.</div></div>\",\"PeriodicalId\":13862,\"journal\":{\"name\":\"Intelligence\",\"volume\":\"110 \",\"pages\":\"Article 101922\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Intelligence\",\"FirstCategoryId\":\"102\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S016028962500025X\",\"RegionNum\":2,\"RegionCategory\":\"心理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"PSYCHOLOGY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligence","FirstCategoryId":"102","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S016028962500025X","RegionNum":2,"RegionCategory":"心理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHOLOGY, MULTIDISCIPLINARY","Score":null,"Total":0}
Psychometrically derived 60-question benchmarks: Substantial efficiencies and the possibility of human-AI comparisons
Large Language Model (LLM) benchmark evaluation tests often comprise thousands of questions. Based on psychometric principles, reliable and valid benchmark tests can likely be developed with as few as 60 items, comparable to human intelligence tests, which typically include only 15 to 60 items. The establishment of shorter benchmark tests offers numerous potential benefits, including more efficient evaluation of LLMs, the practical feasibility of creating parallel forms, and the ability to directly compare LLM performance with human capabilities. Consequently, we analysed the performance of 591 LLMs across three widely recognized benchmarks—HellaSwag, Winogrande, and GSM8K—and developed short-forms (≈ 60 questions each) using psychometric principles. The short-forms exhibited high internal consistency reliability, with coefficient omega values ranging from 0.96 for Winogrande to 0.99 for HellaSwag and GSM8K. Additionally, strong correlations between short- and long-form scores (r ≈ 0.90) provided evidence of concurrent validity. Finally, model size (number of parameters) was a slightly stronger predictor of overall LLM performance for the short-forms compared to the long-forms, indicating that the short forms exhibited comparable, if not slightly superior, convergent validity. It is concluded that shorter benchmarks may accelerate AI development by enabling more efficient evaluations. Additionally, research into the nature of intelligence may be facilitated by benchmark short-forms by enabling direct comparisons between AI and human performance.
期刊介绍:
This unique journal in psychology is devoted to publishing original research and theoretical studies and review papers that substantially contribute to the understanding of intelligence. It provides a new source of significant papers in psychometrics, tests and measurement, and all other empirical and theoretical studies in intelligence and mental retardation.