All Your Base Are Belong to Us: The Urgent Reality of Unproctored Testing in the Age of LLMs

IF 2.4 4区管理学 Q3 MANAGEMENT

International Journal of Selection and Assessment Pub Date : 2025-03-04 DOI:10.1111/ijsa.70005

Louis Hickman

{"title":"All Your Base Are Belong to Us: The Urgent Reality of Unproctored Testing in the Age of LLMs","authors":"Louis Hickman","doi":"10.1111/ijsa.70005","DOIUrl":null,"url":null,"abstract":"<p>The release of new generative artificial intelligence (AI) tools, including new large language models (LLMs), continues at a rapid pace. Upon the release of OpenAI's new o1 models, I reconducted Hickman et al.'s (2024) analyses examining how well LLMs perform on a quantitative ability (number series) test. GPT-4 scored below the 20th percentile (compared to thousands of human test takers), but o1 scored at the 95th percentile. In response to these updated findings and Lievens and Dunlop's (2025) article about the effects of LLMs on the validity of pre-employment assessments, I make an urgent call to action for selection and assessment researchers and practitioners. A recent survey suggests that a large proportion of applicants are already using generative AI tools to complete high-stakes assessments, and it seems that no current assessments will be safe for long. Thus, I offer possibilities for the future of testing, detail their benefits and drawbacks, and provide recommendations. These possibilities are: increased use of proctoring, adding strict time limits, using LLM detection software, using think-aloud (or similar) protocols, collecting and analyzing trace data, emphasizing samples over signs, and redesigning assessments to allow LLM use during completion. Several of these possibilities inspire future research to modernize assessment. Future research should seek to improve our understanding of how to design valid assessments that allow LLM use, how to effectively use trace test-taker data, and whether think-aloud protocols can help differentiate experts and novices.</p>","PeriodicalId":51465,"journal":{"name":"International Journal of Selection and Assessment","volume":"33 2","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/ijsa.70005","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Selection and Assessment","FirstCategoryId":"91","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/ijsa.70005","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}

引用次数: 0

Abstract

The release of new generative artificial intelligence (AI) tools, including new large language models (LLMs), continues at a rapid pace. Upon the release of OpenAI's new o1 models, I reconducted Hickman et al.'s (2024) analyses examining how well LLMs perform on a quantitative ability (number series) test. GPT-4 scored below the 20th percentile (compared to thousands of human test takers), but o1 scored at the 95th percentile. In response to these updated findings and Lievens and Dunlop's (2025) article about the effects of LLMs on the validity of pre-employment assessments, I make an urgent call to action for selection and assessment researchers and practitioners. A recent survey suggests that a large proportion of applicants are already using generative AI tools to complete high-stakes assessments, and it seems that no current assessments will be safe for long. Thus, I offer possibilities for the future of testing, detail their benefits and drawbacks, and provide recommendations. These possibilities are: increased use of proctoring, adding strict time limits, using LLM detection software, using think-aloud (or similar) protocols, collecting and analyzing trace data, emphasizing samples over signs, and redesigning assessments to allow LLM use during completion. Several of these possibilities inspire future research to modernize assessment. Future research should seek to improve our understanding of how to design valid assessments that allow LLM use, how to effectively use trace test-taker data, and whether think-aloud protocols can help differentiate experts and novices.

查看原文本刊更多论文

你所有的基础都属于我们：法学硕士时代无监考考试的紧迫现实

新的生成式人工智能（AI）工具，包括新的大型语言模型（llm），继续以快速的速度发布。在OpenAI的新01模型发布后，我重新执行了Hickman等人（2024）的分析，研究llm在定量能力（数字序列）测试中的表现。GPT-4的得分低于第20百分位（与数千名人类考生相比），但有11人得分在第95百分位。为了回应这些最新的发现，以及Lievens和Dunlop（2025）关于法学硕士对就业前评估有效性影响的文章，我紧急呼吁选择和评估研究人员和从业者采取行动。最近的一项调查显示，很大一部分申请人已经在使用生成式人工智能工具来完成高风险评估，而且目前的评估似乎不会长期安全。因此，我提供了未来测试的可能性，详细说明了它们的优点和缺点，并提供了建议。这些可能性包括：增加监督的使用，增加严格的时间限制，使用LLM检测软件，使用有声思考（或类似）协议，收集和分析跟踪数据，强调样本而不是标志，以及重新设计评估以允许在完工期间使用LLM。其中一些可能性激发了未来的研究，使评估现代化。未来的研究应该寻求提高我们对如何设计允许法学硕士使用的有效评估的理解，如何有效地使用跟踪考生数据，以及是否有声思考协议可以帮助区分专家和新手。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Selection and Assessment Multiple-

CiteScore

4.10

自引率

31.80%

发文量

期刊介绍： The International Journal of Selection and Assessment publishes original articles related to all aspects of personnel selection, staffing, and assessment in organizations. Using an effective combination of academic research with professional-led best practice, IJSA aims to develop new knowledge and understanding in these important areas of work psychology and contemporary workforce management.