{"title":"Leveraging LLM respondents for item evaluation: A psychometric analysis","authors":"Yunting Liu, Shreya Bhandari, Zachary A. Pardos","doi":"10.1111/bjet.13570","DOIUrl":null,"url":null,"abstract":"<div>\n \n <section>\n \n <p>Effective educational measurement relies heavily on the curation of well-designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents. Results indicate that some LLMs exhibit proficiency in College Algebra that is similar to or exceeds that of college students. However, we find the LLMs used in this study to have narrow proficiency distributions, limiting their ability to fully mimic the variability observed in human respondents, but that an ensemble of LLMs can better approximate the broader ability distribution typical of college students. Utilizing item response theory, the item parameters calibrated by LLM respondents have high correlations (eg, >0.8 for GPT-3.5) with their human calibrated counterparts. Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).</p>\n </section>\n \n <section>\n \n <div>\n \n <div>\n \n <h3>Practitioner notes</h3>\n <p>What is already known about this topic\n </p><ul>\n \n <li>The collection of human responses to candidate test items is common practice in educational measurement when designing an assessment tool.</li>\n \n <li>Large language models (LLMs) have been found to rival human abilities in a variety of subject areas, making them a low-cost option for testing the efficacy of educational assessment items.</li>\n \n <li>Data augmentation using AI has been an effective strategy for enhancing machine learning model performance.</li>\n </ul>\n \n <p>What this paper adds\n </p><ul>\n \n <li>This paper provides the first psychometric analysis of the ability distribution of a variety of open-source and proprietary LLMs as compared to humans.</li>\n \n <li>The study finds that item parameters similar to those produced by 50 undergraduate respondents.</li>\n \n <li>Using LLM respondents to augment human response data yields mixed results.</li>\n </ul>\n \n <p>Implications for practice and/or policy\n </p><ul>\n \n <li>The moderate performance of LLM respondents by themselves suggests that they could provide a low-cost option for curating quality items for low-stakes formative or summative assessments.</li>\n \n <li>This methodology offers a scalable way to evaluate vast amounts of generative AI-produced items.</li>\n </ul>\n \n </div>\n </div>\n </section>\n </div>","PeriodicalId":48315,"journal":{"name":"British Journal of Educational Technology","volume":"56 3","pages":"1028-1052"},"PeriodicalIF":6.7000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/bjet.13570","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Educational Technology","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13570","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
Abstract
Effective educational measurement relies heavily on the curation of well-designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents. Results indicate that some LLMs exhibit proficiency in College Algebra that is similar to or exceeds that of college students. However, we find the LLMs used in this study to have narrow proficiency distributions, limiting their ability to fully mimic the variability observed in human respondents, but that an ensemble of LLMs can better approximate the broader ability distribution typical of college students. Utilizing item response theory, the item parameters calibrated by LLM respondents have high correlations (eg, >0.8 for GPT-3.5) with their human calibrated counterparts. Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).
Practitioner notes
What is already known about this topic
The collection of human responses to candidate test items is common practice in educational measurement when designing an assessment tool.
Large language models (LLMs) have been found to rival human abilities in a variety of subject areas, making them a low-cost option for testing the efficacy of educational assessment items.
Data augmentation using AI has been an effective strategy for enhancing machine learning model performance.
What this paper adds
This paper provides the first psychometric analysis of the ability distribution of a variety of open-source and proprietary LLMs as compared to humans.
The study finds that item parameters similar to those produced by 50 undergraduate respondents.
Using LLM respondents to augment human response data yields mixed results.
Implications for practice and/or policy
The moderate performance of LLM respondents by themselves suggests that they could provide a low-cost option for curating quality items for low-stakes formative or summative assessments.
This methodology offers a scalable way to evaluate vast amounts of generative AI-produced items.
期刊介绍:
BJET is a primary source for academics and professionals in the fields of digital educational and training technology throughout the world. The Journal is published by Wiley on behalf of The British Educational Research Association (BERA). It publishes theoretical perspectives, methodological developments and high quality empirical research that demonstrate whether and how applications of instructional/educational technology systems, networks, tools and resources lead to improvements in formal and non-formal education at all levels, from early years through to higher, technical and vocational education, professional development and corporate training.