The answer may vary: large language model response patterns challenge their use in test item analysis.

IF 3.3 2区教育学 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

Medical Teacher Pub Date : 2025-11-01 Epub Date: 2025-05-04 DOI:10.1080/0142159X.2025.2497891

Lauren K Buhl

{"title":"The answer may vary: large language model response patterns challenge their use in test item analysis.","authors":"Lauren K Buhl","doi":"10.1080/0142159X.2025.2497891","DOIUrl":null,"url":null,"abstract":"Introduction: The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population.Methods: Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows.Results: Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28-0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices.Discussion: These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM's response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.","PeriodicalId":18643,"journal":{"name":"Medical Teacher","volume":" ","pages":"1761-1766"},"PeriodicalIF":3.3000,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Teacher","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1080/0142159X.2025.2497891","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/5/4 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: The validation of multiple-choice question (MCQ)-based assessments typically requires administration to a test population, which is resource-intensive and practically demanding. Large language models (LLMs) are a promising tool to aid in many aspects of assessment development, including the challenge of determining the psychometric properties of test items. This study investigated whether LLMs could predict the difficulty and point biserial indices of MCQs, potentially alleviating the need for preliminary analysis in a test population.

Methods: Sixty MCQs developed by subject matter experts in anesthesiology were presented one hundred times each to five different LLMs (ChatGPT-4o, o1-preview, Claude 3.5 Sonnet, Grok-2, and Llama 3.2) and to clinical fellows. Response patterns were analyzed, and difficulty indices (proportion of correct responses) and point biserial indices (item-test score correlation) were calculated. Spearman correlation coefficients were used to compare difficulty and point biserial indices between the LLMs and fellows.

Results: Marked differences in response patterns were observed among LLMs: ChatGPT-4o, o1-preview, and Grok-2 showed variable responses across trials, while Claude 3.5 Sonnet and Llama 3.2 gave consistent responses. The LLMs outperformed fellows with mean scores of 58% to 85% compared to 57% for the fellows. Three LLMs showed a weak correlation with fellow difficulty indices (r = 0.28-0.29), while the two highest scoring models showed no correlation. No LLM predicted the point biserial indices.

Discussion: These findings suggest LLMs have limited utility in predicting MCQ performance metrics. Notably, higher-scoring models showed less correlation with human performance, suggesting that as models become more powerful, their ability to predict human performance may decrease. Understanding the consistency of an LLM's response pattern is critical for both research methodology and practical applications in test development. Future work should focus on leveraging the language-processing capabilities of LLMs for overall assessment optimization (e.g., inter-item correlation) rather than predicting item characteristics.

查看原文本刊更多论文

答案可能各不相同：大型语言模型响应模式挑战了它们在测试项目分析中的使用。

导读：基于选择题（MCQ）评估的验证通常需要对测试人群进行管理，这是资源密集型的，并且实际要求很高。大型语言模型（llm）是一个很有前途的工具，可以在评估发展的许多方面提供帮助，包括确定测试项目的心理测量特性的挑战。本研究探讨了llm是否可以预测mcq的难度和点双序列指数，从而可能减轻对测试人群进行初步分析的需要。方法：将麻醉学科专家开发的60份mcq分别向5个不同的LLMs （chatgpt - 40、01 -preview、Claude 3.5 Sonnet、Grok-2和Llama 3.2）和临床研究员提交100次。分析反应模式，计算难度指数（正确率）和点双列指数（题项得分相关性）。Spearman相关系数用于比较llm和研究员之间的难度和点双列指数。结果：llm之间的反应模式存在显著差异：chatgpt - 40、01 -preview和Grok-2在不同的试验中表现出不同的反应，而Claude 3.5 Sonnet和Llama 3.2则表现出一致的反应。法学硕士的平均得分为58%到85%，而普通研究生的平均得分为57%。三个llm与其他难度指数呈弱相关（r = 0.28-0.29），而两个得分最高的模型没有相关性。没有LLM预测点双列指数。讨论：这些发现表明llm在预测MCQ性能指标方面的效用有限。值得注意的是，得分较高的模型与人类表现的相关性较低，这表明随着模型变得更强大，它们预测人类表现的能力可能会下降。理解法学硕士响应模式的一致性对于研究方法和测试开发中的实际应用都是至关重要的。未来的工作应侧重于利用llm的语言处理能力进行整体评估优化（例如，项目间相关性），而不是预测项目特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Medical Teacher 医学-卫生保健

CiteScore

7.80

自引率

8.50%

发文量

396

审稿时长

3-6 weeks

期刊介绍： Medical Teacher provides accounts of new teaching methods, guidance on structuring courses and assessing achievement, and serves as a forum for communication between medical teachers and those involved in general education. In particular, the journal recognizes the problems teachers have in keeping up-to-date with the developments in educational methods that lead to more effective teaching and learning at a time when the content of the curriculum—from medical procedures to policy changes in health care provision—is also changing. The journal features reports of innovation and research in medical education, case studies, survey articles, practical guidelines, reviews of current literature and book reviews. All articles are peer reviewed.