Olena Bolgova, Paul Ganguly, Muhammad Faisal Ikram, Volodymyr Mavrych
{"title":"Evaluating large language models as graders of medical short answer questions: a comparative analysis with expert human graders.","authors":"Olena Bolgova, Paul Ganguly, Muhammad Faisal Ikram, Volodymyr Mavrych","doi":"10.1080/10872981.2025.2550751","DOIUrl":null,"url":null,"abstract":"<p><p>The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen's Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.</p>","PeriodicalId":47656,"journal":{"name":"Medical Education Online","volume":"30 1","pages":"2550751"},"PeriodicalIF":3.8000,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12377152/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Education Online","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/10872981.2025.2550751","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/24 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}
引用次数: 0
Abstract
The assessment of short-answer questions (SAQs) in medical education is resource-intensive, requiring significant expert time. Large Language Models (LLMs) offer potential for automating this process, but their efficacy in specialized medical education assessment remains understudied. To evaluate the capability of five LLMs to grade medical SAQs compared to expert human graders across four distinct medical disciplines. This study analyzed 804 student responses across anatomy, histology, embryology, and physiology. Three faculty members graded all responses. Five LLMs (GPT-4.1, Gemini, Claude, Copilot, DeepSeek) evaluated responses twice: first using their learned representations to generate their own grading criteria (A1), then using expert-provided rubrics (A2). Agreement was measured using Cohen's Kappa and Intraclass Correlation Coefficient (ICC). Expert-expert agreement was substantial across all questions (average Kappa: 0.69, ICC: 0.86), ranging from moderate (SAQ2: 0.57) to almost perfect (SAQ4: 0.87). LLM performance varied dramatically by question type and model. The highest expert-LLM agreement was observed for Claude on SAQ3 (Kappa: 0.61) and DeepSeek on SAQ2 (Kappa: 0.53). Providing expert criteria had inconsistent effects, significantly improving some model-question combinations while decreasing others. No single LLM consistently outperformed others across all domains. LLM strictness in grading unsatisfactory responses varied substantially from experts. LLMs demonstrated domain-specific variations in grading capabilities. The provision of expert criteria did not consistently improve performance. While LLMs show promise for supporting medical education assessment, their implementation requires domain-specific considerations and continued human oversight.
期刊介绍:
Medical Education Online is an open access journal of health care education, publishing peer-reviewed research, perspectives, reviews, and early documentation of new ideas and trends.
Medical Education Online aims to disseminate information on the education and training of physicians and other health care professionals. Manuscripts may address any aspect of health care education and training, including, but not limited to:
-Basic science education
-Clinical science education
-Residency education
-Learning theory
-Problem-based learning (PBL)
-Curriculum development
-Research design and statistics
-Measurement and evaluation
-Faculty development
-Informatics/web