Harnessing Generative AI for Assessment Item Development: Comparing AI-Generated and Human-Authored Items

IF 2.4 4区管理学 Q3 MANAGEMENT

International Journal of Selection and Assessment Pub Date : 2025-08-27 DOI:10.1111/ijsa.70021

Jaclyn Martin Kowal, Kenzie Hurley Bryant, Dan Segall, Tracy Kantrowitz

{"title":"Harnessing Generative AI for Assessment Item Development: Comparing AI-Generated and Human-Authored Items","authors":"Jaclyn Martin Kowal, Kenzie Hurley Bryant, Dan Segall, Tracy Kantrowitz","doi":"10.1111/ijsa.70021","DOIUrl":null,"url":null,"abstract":"<p>The use of generative AI, specifically large language models (LLMs), in test development presents an innovative approach to efficiently creating technical, knowledge-based assessment items. This study evaluates the efficacy of AI-generated items compared to human-authored counterparts within the context of employee selection testing, focusing on data science knowledge areas. Through a paired comparison approach, subject matter experts (SMEs) were asked to evaluate items produced by both LLMs and human item writers. Findings revealed a significant preference for LLM-generated items, particularly in specific knowledge domains such as Statistical Foundations and Scientific Data Analysis. However, despite the promise of generative AI in accelerating item development, human review remains critical. Issues such as multiple correct answers or ineffective distractors in AI-generated items necessitate thorough SME review and revision to ensure quality and validity. The study highlights the potential of integrating AI with human expertise to enhance the efficiency of item generation while maintaining psychometric standards in high-stakes environments. The implications for psychometric practice and the necessity of domain-specific validation are discussed, offering a framework for future research and application of AI in test development.</p>","PeriodicalId":51465,"journal":{"name":"International Journal of Selection and Assessment","volume":"33 3","pages":""},"PeriodicalIF":2.4000,"publicationDate":"2025-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/ijsa.70021","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Selection and Assessment","FirstCategoryId":"91","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/ijsa.70021","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MANAGEMENT","Score":null,"Total":0}

引用次数: 0

Abstract

The use of generative AI, specifically large language models (LLMs), in test development presents an innovative approach to efficiently creating technical, knowledge-based assessment items. This study evaluates the efficacy of AI-generated items compared to human-authored counterparts within the context of employee selection testing, focusing on data science knowledge areas. Through a paired comparison approach, subject matter experts (SMEs) were asked to evaluate items produced by both LLMs and human item writers. Findings revealed a significant preference for LLM-generated items, particularly in specific knowledge domains such as Statistical Foundations and Scientific Data Analysis. However, despite the promise of generative AI in accelerating item development, human review remains critical. Issues such as multiple correct answers or ineffective distractors in AI-generated items necessitate thorough SME review and revision to ensure quality and validity. The study highlights the potential of integrating AI with human expertise to enhance the efficiency of item generation while maintaining psychometric standards in high-stakes environments. The implications for psychometric practice and the necessity of domain-specific validation are discussed, offering a framework for future research and application of AI in test development.

Abstract Image

查看原文本刊更多论文

利用生成式人工智能进行评估项目开发：比较人工智能生成和人类创作的项目

在测试开发中使用生成式人工智能，特别是大型语言模型（llm），提出了一种创新的方法来有效地创建技术性的、基于知识的评估项目。本研究在员工选择测试的背景下，评估了人工智能生成的项目与人类撰写的项目相比的功效，重点是数据科学知识领域。通过配对比较方法，主题专家（sme）被要求评估法学硕士和人类项目作者制作的项目。调查结果显示，学生对法学硕士课程产生的项目有明显的偏好，特别是在统计基础和科学数据分析等特定知识领域。然而，尽管生成式人工智能有望加速项目开发，但人工审核仍然至关重要。人工智能生成的项目中存在多个正确答案或无效干扰等问题，需要进行彻底的SME审查和修订，以确保质量和有效性。该研究强调了将人工智能与人类专业知识相结合的潜力，以提高项目生成的效率，同时在高风险环境中保持心理测量标准。讨论了对心理测量实践的影响和特定领域验证的必要性，为人工智能在测试开发中的未来研究和应用提供了一个框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Selection and Assessment Multiple-

CiteScore

4.10

自引率

31.80%

发文量

期刊介绍： The International Journal of Selection and Assessment publishes original articles related to all aspects of personnel selection, staffing, and assessment in organizations. Using an effective combination of academic research with professional-led best practice, IJSA aims to develop new knowledge and understanding in these important areas of work psychology and contemporary workforce management.