评估大型语言模型作为医学简答题评分者：与专家评分者的比较分析。

IF 3.8 2区医学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Medical Education Online Pub Date : 2025-12-01 Epub Date: 2025-08-24 DOI:10.1080/10872981.2025.2550751

Olena Bolgova, Paul Ganguly, Muhammad Faisal Ikram, Volodymyr Mavrych

{"title":"评估大型语言模型作为医学简答题评分者：与专家评分者的比较分析。","authors":"Olena Bolgova, Paul Ganguly, Muhammad Faisal Ikram, Volodymyr Mavrych","doi":"10.1080/10872981.2025.2550751","DOIUrl":null,"url":null,"abstract":"The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen's Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.","PeriodicalId":47656,"journal":{"name":"Medical Education Online","volume":"30 1","pages":"2550751"},"PeriodicalIF":3.8000,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12377152/pdf/","citationCount":"0","resultStr":"{\"title\":\"Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.\",\"authors\":\"Olena Bolgova, Paul Ganguly, Muhammad Faisal Ikram, Volodymyr Mavrych\",\"doi\":\"10.1080/10872981.2025.2550751\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen's Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.\",\"PeriodicalId\":47656,\"journal\":{\"name\":\"Medical Education Online\",\"volume\":\"30 1\",\"pages\":\"2550751\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12377152/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Education Online\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/10872981.2025.2550751\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/8/24 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education Online","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/10872981.2025.2550751","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/24 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

摘要

医学教育中简答题的评估是资源密集型的，需要大量的专家时间。大型语言模型（LLMs）为自动化这一过程提供了潜力，但它们在专业医学教育评估中的有效性仍有待研究。在四个不同的医学学科中，与专家评分者相比，评估五个法学硕士对医学saq评分的能力。这项研究分析了804名学生在解剖学、组织学、胚胎学和生理学方面的反应。三位教员给所有的回答打分。五位法学硕士（GPT-4.1, Gemini, Claude, Copilot, DeepSeek）对回答进行了两次评估：首先使用他们学习到的表示来生成自己的评分标准（A1），然后使用专家提供的标准（A2）。使用Cohen’s Kappa和类内相关系数（ICC）来衡量一致性。专家-专家的共识在所有问题上都是实质性的（平均Kappa: 0.69, ICC: 0.86），范围从中等（SAQ2: 0.57）到几乎完美（SAQ4: 0.87）。LLM的性能因问题类型和模型的不同而有很大差异。Claude在SAQ3 （Kappa: 0.61）和DeepSeek在SAQ2 （Kappa: 0.53）上的专家与法学硕士的一致性最高。提供专家标准的效果不一致，显著改善了一些模型-问题组合，同时降低了其他模型-问题组合。没有一个法学硕士在所有领域都能始终优于其他法学硕士。法学硕士对不满意的回答评分的严格程度与专家有很大不同。法学硕士在分级能力方面展示了领域特定的变化。提供专家标准并没有持续地改善业绩。虽然法学硕士显示出支持医学教育评估的希望，但它们的实施需要特定领域的考虑和持续的人类监督。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.

The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen's Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Medical Education Online EDUCATION & EDUCATIONAL RESEARCH-

CiteScore

6.00

自引率

2.20%

发文量

审稿时长

8 weeks

期刊介绍： Medical Education Online is an open access journal of health care education, publishing peer-reviewed research, perspectives, reviews, and early documentation of new ideas and trends. Medical Education Online aims to disseminate information on the education and training of physicians and other health care professionals. Manuscripts may address any aspect of health care education and training, including, but not limited to: -Basic science education -Clinical science education -Residency education -Learning theory -Problem-based learning (PBL) -Curriculum development -Research design and statistics -Measurement and evaluation -Faculty development -Informatics/web