Leveraging LLM respondents for item evaluation: A psychometric analysis

IF 8.1 1区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

British Journal of Educational Technology Pub Date : 2025-02-24 DOI:10.1111/bjet.13570

Yunting Liu, Shreya Bhandari, Zachary A. Pardos

{"title":"Leveraging LLM respondents for item evaluation: A psychometric analysis","authors":"Yunting Liu, Shreya Bhandari, Zachary A. Pardos","doi":"10.1111/bjet.13570","DOIUrl":null,"url":null,"abstract":"<div>\n \n <section>\n \n <p>Effective educational measurement relies heavily on the curation of well-designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents. Results indicate that some LLMs exhibit proficiency in College Algebra that is similar to or exceeds that of college students. However, we find the LLMs used in this study to have narrow proficiency distributions, limiting their ability to fully mimic the variability observed in human respondents, but that an ensemble of LLMs can better approximate the broader ability distribution typical of college students. Utilizing item response theory, the item parameters calibrated by LLM respondents have high correlations (eg, >0.8 for GPT-3.5) with their human calibrated counterparts. Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).</p>\n </section>\n \n <section>\n \n <div>\n \n <div>\n \n <h3>Practitioner notes</h3>\n <p>What is already known about this topic\n </p><ul>\n \n <li>The collection of human responses to candidate test items is common practice in educational measurement when designing an assessment tool.</li>\n \n <li>Large language models (LLMs) have been found to rival human abilities in a variety of subject areas, making them a low-cost option for testing the efficacy of educational assessment items.</li>\n \n <li>Data augmentation using AI has been an effective strategy for enhancing machine learning model performance.</li>\n </ul>\n \n <p>What this paper adds\n </p><ul>\n \n <li>This paper provides the first psychometric analysis of the ability distribution of a variety of open-source and proprietary LLMs as compared to humans.</li>\n \n <li>The study finds that item parameters similar to those produced by 50 undergraduate respondents.</li>\n \n <li>Using LLM respondents to augment human response data yields mixed results.</li>\n </ul>\n \n <p>Implications for practice and/or policy\n </p><ul>\n \n <li>The moderate performance of LLM respondents by themselves suggests that they could provide a low-cost option for curating quality items for low-stakes formative or summative assessments.</li>\n \n <li>This methodology offers a scalable way to evaluate vast amounts of generative AI-produced items.</li>\n </ul>\n \n </div>\n </div>\n </section>\n </div>","PeriodicalId":48315,"journal":{"name":"British Journal of Educational Technology","volume":"56 3","pages":"1028-1052"},"PeriodicalIF":8.1000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/bjet.13570","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Educational Technology","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13570","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

Abstract

Effective educational measurement relies heavily on the curation of well-designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents. Results indicate that some LLMs exhibit proficiency in College Algebra that is similar to or exceeds that of college students. However, we find the LLMs used in this study to have narrow proficiency distributions, limiting their ability to fully mimic the variability observed in human respondents, but that an ensemble of LLMs can better approximate the broader ability distribution typical of college students. Utilizing item response theory, the item parameters calibrated by LLM respondents have high correlations (eg, >0.8 for GPT-3.5) with their human calibrated counterparts. Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).

Practitioner notes

What is already known about this topic

The collection of human responses to candidate test items is common practice in educational measurement when designing an assessment tool.
Large language models (LLMs) have been found to rival human abilities in a variety of subject areas, making them a low-cost option for testing the efficacy of educational assessment items.
Data augmentation using AI has been an effective strategy for enhancing machine learning model performance.

What this paper adds

This paper provides the first psychometric analysis of the ability distribution of a variety of open-source and proprietary LLMs as compared to humans.
The study finds that item parameters similar to those produced by 50 undergraduate respondents.
Using LLM respondents to augment human response data yields mixed results.

Implications for practice and/or policy

The moderate performance of LLM respondents by themselves suggests that they could provide a low-cost option for curating quality items for low-stakes formative or summative assessments.
This methodology offers a scalable way to evaluate vast amounts of generative AI-produced items.

Abstract Image

查看原文本刊更多论文

利用法学硕士受访者的项目评估：心理测量分析

有效的教育测量在很大程度上依赖于精心设计的项目池。然而，项目校准是耗时和昂贵的，需要足够数量的受访者来估计项目的心理测量特性。在这项研究中，我们探索了六种不同的大型语言模型(llm；GPT-3.5, GPT-4, Llama 2, Llama 3， Gemini-Pro和coherence Command R Plus)生成具有与人类受访者相当的心理测量属性的反应。结果表明，一些法学硕士在大学代数方面表现出与大学生相似或超过大学生的熟练程度。然而，我们发现本研究中使用的法学硕士具有狭窄的熟练度分布，限制了它们完全模拟人类受访者中观察到的变异性的能力，但是法学硕士的集合可以更好地近似大学生的典型的更广泛的能力分布。利用项目反应理论，LLM被调查者校准的项目参数与人类校准的项目参数具有高相关性（例如，GPT-3.5为>；0.8）。对几种增强策略的相对性能进行了评估，重采样方法被证明是最有效的，将Spearman相关性从0.89（仅人类）提高到0.93（增强人类）。在设计评估工具时，收集人类对候选测试项目的反应是教育测量的常见做法。人们发现，大型语言模型（llm）在许多学科领域可以与人类的能力相媲美，这使它们成为测试教育评估项目有效性的低成本选择。使用人工智能进行数据增强是提高机器学习模型性能的有效策略。这篇论文首次对各种开源和专有llm的能力分布进行了心理测量分析，并与人类进行了比较。研究发现，项目参数与50名大学生受访者所产生的参数相似。使用法学硕士受访者来增强人类反应数据会产生不同的结果。对实践和/或政策的启示法学硕士受访者自身的中等表现表明，他们可以为低风险的形成性或总结性评估提供低成本的选择，以策划高质量的项目。这种方法提供了一种可扩展的方法来评估大量生成人工智能生产的物品。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

British Journal of Educational Technology EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

15.60

自引率

4.50%

发文量

111

期刊介绍： BJET is a primary source for academics and professionals in the fields of digital educational and training technology throughout the world. The Journal is published by Wiley on behalf of The British Educational Research Association (BERA). It publishes theoretical perspectives, methodological developments and high quality empirical research that demonstrate whether and how applications of instructional/educational technology systems, networks, tools and resources lead to improvements in formal and non-formal education at all levels, from early years through to higher, technical and vocational education, professional development and corporate training.