Leveraging LLM respondents for item evaluation: A psychometric analysis

IF 6.7 1区 教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH
Yunting Liu, Shreya Bhandari, Zachary A. Pardos
{"title":"Leveraging LLM respondents for item evaluation: A psychometric analysis","authors":"Yunting Liu,&nbsp;Shreya Bhandari,&nbsp;Zachary A. Pardos","doi":"10.1111/bjet.13570","DOIUrl":null,"url":null,"abstract":"<div>\n \n <section>\n \n <p>Effective educational measurement relies heavily on the curation of well-designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents. Results indicate that some LLMs exhibit proficiency in College Algebra that is similar to or exceeds that of college students. However, we find the LLMs used in this study to have narrow proficiency distributions, limiting their ability to fully mimic the variability observed in human respondents, but that an ensemble of LLMs can better approximate the broader ability distribution typical of college students. Utilizing item response theory, the item parameters calibrated by LLM respondents have high correlations (eg, &gt;0.8 for GPT-3.5) with their human calibrated counterparts. Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).</p>\n </section>\n \n <section>\n \n <div>\n \n <div>\n \n <h3>Practitioner notes</h3>\n <p>What is already known about this topic\n </p><ul>\n \n <li>The collection of human responses to candidate test items is common practice in educational measurement when designing an assessment tool.</li>\n \n <li>Large language models (LLMs) have been found to rival human abilities in a variety of subject areas, making them a low-cost option for testing the efficacy of educational assessment items.</li>\n \n <li>Data augmentation using AI has been an effective strategy for enhancing machine learning model performance.</li>\n </ul>\n \n <p>What this paper adds\n </p><ul>\n \n <li>This paper provides the first psychometric analysis of the ability distribution of a variety of open-source and proprietary LLMs as compared to humans.</li>\n \n <li>The study finds that item parameters similar to those produced by 50 undergraduate respondents.</li>\n \n <li>Using LLM respondents to augment human response data yields mixed results.</li>\n </ul>\n \n <p>Implications for practice and/or policy\n </p><ul>\n \n <li>The moderate performance of LLM respondents by themselves suggests that they could provide a low-cost option for curating quality items for low-stakes formative or summative assessments.</li>\n \n <li>This methodology offers a scalable way to evaluate vast amounts of generative AI-produced items.</li>\n </ul>\n \n </div>\n </div>\n </section>\n </div>","PeriodicalId":48315,"journal":{"name":"British Journal of Educational Technology","volume":"56 3","pages":"1028-1052"},"PeriodicalIF":6.7000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/bjet.13570","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"British Journal of Educational Technology","FirstCategoryId":"95","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/bjet.13570","RegionNum":1,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0

Abstract

Effective educational measurement relies heavily on the curation of well-designed item pools. However, item calibration is time consuming and costly, requiring a sufficient number of respondents to estimate the psychometric properties of items. In this study, we explore the potential of six different large language models (LLMs; GPT-3.5, GPT-4, Llama 2, Llama 3, Gemini-Pro and Cohere Command R Plus) to generate responses with psychometric properties comparable to those of human respondents. Results indicate that some LLMs exhibit proficiency in College Algebra that is similar to or exceeds that of college students. However, we find the LLMs used in this study to have narrow proficiency distributions, limiting their ability to fully mimic the variability observed in human respondents, but that an ensemble of LLMs can better approximate the broader ability distribution typical of college students. Utilizing item response theory, the item parameters calibrated by LLM respondents have high correlations (eg, >0.8 for GPT-3.5) with their human calibrated counterparts. Several augmentation strategies are evaluated for their relative performance, with resampling methods proving most effective, enhancing the Spearman correlation from 0.89 (human only) to 0.93 (augmented human).

Practitioner notes

What is already known about this topic

  • The collection of human responses to candidate test items is common practice in educational measurement when designing an assessment tool.
  • Large language models (LLMs) have been found to rival human abilities in a variety of subject areas, making them a low-cost option for testing the efficacy of educational assessment items.
  • Data augmentation using AI has been an effective strategy for enhancing machine learning model performance.

What this paper adds

  • This paper provides the first psychometric analysis of the ability distribution of a variety of open-source and proprietary LLMs as compared to humans.
  • The study finds that item parameters similar to those produced by 50 undergraduate respondents.
  • Using LLM respondents to augment human response data yields mixed results.

Implications for practice and/or policy

  • The moderate performance of LLM respondents by themselves suggests that they could provide a low-cost option for curating quality items for low-stakes formative or summative assessments.
  • This methodology offers a scalable way to evaluate vast amounts of generative AI-produced items.

Abstract Image

求助全文
约1分钟内获得全文 求助全文
来源期刊
British Journal of Educational Technology
British Journal of Educational Technology EDUCATION & EDUCATIONAL RESEARCH-
CiteScore
15.60
自引率
4.50%
发文量
111
期刊介绍: BJET is a primary source for academics and professionals in the fields of digital educational and training technology throughout the world. The Journal is published by Wiley on behalf of The British Educational Research Association (BERA). It publishes theoretical perspectives, methodological developments and high quality empirical research that demonstrate whether and how applications of instructional/educational technology systems, networks, tools and resources lead to improvements in formal and non-formal education at all levels, from early years through to higher, technical and vocational education, professional development and corporate training.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信