The performance of large language models on quantitative and verbal ability tests: Initial evidence and implications for unproctored high-stakes testing

IF 2.6 4区 管理学 Q3 MANAGEMENT
Louis Hickman, Patrick D. Dunlop, Jasper Leo Wolf
{"title":"The performance of large language models on quantitative and verbal ability tests: Initial evidence and implications for unproctored high-stakes testing","authors":"Louis Hickman,&nbsp;Patrick D. Dunlop,&nbsp;Jasper Leo Wolf","doi":"10.1111/ijsa.12479","DOIUrl":null,"url":null,"abstract":"<p>Unproctored assessments are widely used in pre-employment assessment. However, widely accessible large language models (LLMs) pose challenges for unproctored personnel assessments, given that applicants may use them to artificially inflate their scores beyond their true abilities. This may be particularly concerning in cognitive ability tests, which are widely used and traditionally considered to be less fakeable by humans than personality tests. Thus, this study compares the performance of LLMs on two common types of cognitive tests: quantitative ability (number series completion) and verbal ability (use a passage of text to determine whether a statement is true). The tests investigated are used in real-world, high-stakes selection. We also examine the performance of the LLMs across different test formats (i.e., open-ended vs. multiple choice). Further, we contrast the performance of two LLMs (Generative Pretrained Transformers, GPT-3.5 and GPT-4) across multiple prompt approaches and “temperature” settings (i.e., a parameter that determines the amount of randomness in the model's output). We found that the LLMs performed well on the verbal ability test but extremely poorly on the quantitative ability test, even when accounting for the test format. GPT-4 outperformed GPT-3.5 across both types of tests. Notably, although prompt approaches and temperature settings did affect LLM test performance, those effects were mostly minor relative to differences across tests and language models. We provide recommendations for securing pre-employment testing against LLM influences. Additionally, we call for rigorous research investigating the prevalence of LLM usage in pre-employment testing as well as on how LLM usage affects selection test validity.</p>","PeriodicalId":51465,"journal":{"name":"International Journal of Selection and Assessment","volume":"32 4","pages":"499-511"},"PeriodicalIF":2.6000,"publicationDate":"2024-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/ijsa.12479","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Selection and Assessment","FirstCategoryId":"91","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/ijsa.12479","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}
引用次数: 0

Abstract

Unproctored assessments are widely used in pre-employment assessment. However, widely accessible large language models (LLMs) pose challenges for unproctored personnel assessments, given that applicants may use them to artificially inflate their scores beyond their true abilities. This may be particularly concerning in cognitive ability tests, which are widely used and traditionally considered to be less fakeable by humans than personality tests. Thus, this study compares the performance of LLMs on two common types of cognitive tests: quantitative ability (number series completion) and verbal ability (use a passage of text to determine whether a statement is true). The tests investigated are used in real-world, high-stakes selection. We also examine the performance of the LLMs across different test formats (i.e., open-ended vs. multiple choice). Further, we contrast the performance of two LLMs (Generative Pretrained Transformers, GPT-3.5 and GPT-4) across multiple prompt approaches and “temperature” settings (i.e., a parameter that determines the amount of randomness in the model's output). We found that the LLMs performed well on the verbal ability test but extremely poorly on the quantitative ability test, even when accounting for the test format. GPT-4 outperformed GPT-3.5 across both types of tests. Notably, although prompt approaches and temperature settings did affect LLM test performance, those effects were mostly minor relative to differences across tests and language models. We provide recommendations for securing pre-employment testing against LLM influences. Additionally, we call for rigorous research investigating the prevalence of LLM usage in pre-employment testing as well as on how LLM usage affects selection test validity.

Abstract Image

大语言模型在定量和言语能力测试中的表现:初步证据及对未经监考的高考的影响
未经监考的评估被广泛应用于就业前评估。然而,可广泛获取的大型语言模型(LLM)给未经监考的人员测评带来了挑战,因为申请人可能会利用这些模型人为地提高分数,从而超出自己的真实能力。这在认知能力测试中可能尤其令人担忧,因为认知能力测试被广泛使用,而且传统上认为它比人格测试更不容易被人类伪造。因此,本研究比较了法学硕士在两类常见认知测试中的表现:定量能力(数列完成)和言语能力(使用一段文字判断语句是否属实)。所调查的测试均用于现实世界中的高风险选拔。我们还考察了 LLM 在不同测试形式(即开放式与选择式)下的表现。此外,我们还对比了两种 LLM(生成式预训练转换器,GPT-3.5 和 GPT-4)在多种提示方法和 "温度 "设置(即决定模型输出随机性大小的参数)下的表现。我们发现,LLM 在言语能力测试中表现良好,但在定量能力测试中表现极差,即使考虑到测试形式也是如此。在这两种测试中,GPT-4 的表现都优于 GPT-3.5。值得注意的是,虽然提示方法和温度设置确实会影响 LLM 测试成绩,但相对于不同测试和语言模型之间的差异而言,这些影响大多较小。我们就如何确保就业前测试不受 LLM 影响提出了建议。此外,我们还呼吁开展严格的研究,调查在就业前测试中使用 LLM 的普遍程度,以及使用 LLM 如何影响选拔测试的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
4.10
自引率
31.80%
发文量
46
期刊介绍: The International Journal of Selection and Assessment publishes original articles related to all aspects of personnel selection, staffing, and assessment in organizations. Using an effective combination of academic research with professional-led best practice, IJSA aims to develop new knowledge and understanding in these important areas of work psychology and contemporary workforce management.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信