人工智能能像教授一样评分吗？比较人工智能与医学生临床推理简答考试教师评分。

IF 3.3 2区教育学 Q1 EDUCATION & EDUCATIONAL RESEARCH

Advances in Health Sciences Education Pub Date : 2025-08-06 DOI:10.1007/s10459-025-10462-3

Arvind Rajan, Seth McKenzie Alexander, Christina L Shenvi

{"title":"人工智能能像教授一样评分吗？比较人工智能与医学生临床推理简答考试教师评分。","authors":"Arvind Rajan, Seth McKenzie Alexander, Christina L Shenvi","doi":"10.1007/s10459-025-10462-3","DOIUrl":null,"url":null,"abstract":"Many medical schools primarily use multiple-choice questions (MCQs) in pre-clinical assessments due to their efficiency and consistency. However, while MCQs are easy to grade, they often fall short in evaluating higher-order reasoning and understanding student thought processes. Despite these limitations, MCQs remain popular because alternative assessments require more time and resources to grade. This study explored whether OpenAI's GPT-4o Large Language Model (LLM) could be used to effectively grade narrative short answer questions (SAQs) in case-based learning (CBL) exams when compared to faculty graders. The primary outcome was equivalence of LLM grading, assessed using a bootstrapping procedure to calculate 95% confidence intervals (CIs) for mean score differences. Equivalence was defined as the entire 95% CI falling within a ± 5% margin. Secondary outcomes included grading precision, subgroup analysis by Bloom's taxonomy, and correlation between question complexity and LLM performance. Analysis of 1,450 responses showed LLM scores were equivalent to faculty scores overall (mean difference: -0.55%, 95% CI: -1.53%, + 0.45%). Equivalence was also demonstrated for Remembering, Applying, and Analyzing questions, however, discrepancies were observed for Understanding and Evaluating questions. AI grading demonstrated high precision (ICC = 0.993, 95% CI: 0.992-0.994). Greater differences between LLM and faculty scores were found for more difficult questions (R2 = 0.6199, p < 0.0001). LLM grading could serve as a tool for preliminary scoring of student assessments, enhancing SAQ grading efficiency and improving undergraduate medical education examination quality. Secondary outcome findings emphasize the need to use these tools in combination with, not as a replacement for, faculty involvement in the grading process.","PeriodicalId":50959,"journal":{"name":"Advances in Health Sciences Education","volume":" ","pages":""},"PeriodicalIF":3.3000,"publicationDate":"2025-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams.\",\"authors\":\"Arvind Rajan, Seth McKenzie Alexander, Christina L Shenvi\",\"doi\":\"10.1007/s10459-025-10462-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many medical schools primarily use multiple-choice questions (MCQs) in pre-clinical assessments due to their efficiency and consistency. However, while MCQs are easy to grade, they often fall short in evaluating higher-order reasoning and understanding student thought processes. Despite these limitations, MCQs remain popular because alternative assessments require more time and resources to grade. This study explored whether OpenAI's GPT-4o Large Language Model (LLM) could be used to effectively grade narrative short answer questions (SAQs) in case-based learning (CBL) exams when compared to faculty graders. The primary outcome was equivalence of LLM grading, assessed using a bootstrapping procedure to calculate 95% confidence intervals (CIs) for mean score differences. Equivalence was defined as the entire 95% CI falling within a ± 5% margin. Secondary outcomes included grading precision, subgroup analysis by Bloom's taxonomy, and correlation between question complexity and LLM performance. Analysis of 1,450 responses showed LLM scores were equivalent to faculty scores overall (mean difference: -0.55%, 95% CI: -1.53%, + 0.45%). Equivalence was also demonstrated for Remembering, Applying, and Analyzing questions, however, discrepancies were observed for Understanding and Evaluating questions. AI grading demonstrated high precision (ICC = 0.993, 95% CI: 0.992-0.994). Greater differences between LLM and faculty scores were found for more difficult questions (R2 = 0.6199, p < 0.0001). LLM grading could serve as a tool for preliminary scoring of student assessments, enhancing SAQ grading efficiency and improving undergraduate medical education examination quality. Secondary outcome findings emphasize the need to use these tools in combination with, not as a replacement for, faculty involvement in the grading process.\",\"PeriodicalId\":50959,\"journal\":{\"name\":\"Advances in Health Sciences Education\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Advances in Health Sciences Education\",\"FirstCategoryId\":\"95\",\"ListUrlMain\":\"https://doi.org/10.1007/s10459-025-10462-3\",\"RegionNum\":2,\"RegionCategory\":\"教育学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"EDUCATION & EDUCATIONAL RESEARCH\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Health Sciences Education","FirstCategoryId":"95","ListUrlMain":"https://doi.org/10.1007/s10459-025-10462-3","RegionNum":2,"RegionCategory":"教育学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION & EDUCATIONAL RESEARCH","Score":null,"Total":0}

引用次数: 0

摘要

由于选择题的效率和一致性，许多医学院在临床前评估中主要使用选择题（mcq）。然而，尽管mcq很容易评分，但它们在评估高阶推理和理解学生思维过程方面往往存在不足。尽管存在这些限制，mcq仍然很受欢迎，因为替代评估需要更多的时间和资源来评分。本研究探讨了与教师评分相比，OpenAI的gpt - 40大型语言模型（LLM）是否可以用于有效地对基于案例的学习（CBL）考试中的叙述性简答题（saq）进行评分。主要结果是LLM评分的等效性，使用bootstrapping程序计算平均评分差异的95%置信区间（ci）进行评估。等效性定义为整个95% CI落在±5%的范围内。次要结果包括评分精度、Bloom分类法的亚组分析以及问题复杂性与LLM性能之间的相关性。对1450份回复的分析显示，法学硕士的得分与教师的总体得分相当（平均差异：-0.55%,95% CI: -1.53%, + 0.45%）。记忆问题、应用问题和分析问题也证明了等效性，然而，理解问题和评估问题则存在差异。AI分级精度高（ICC = 0.993, 95% CI: 0.992-0.994）。法学硕士和教师在更难的问题上的得分差异更大(R2 = 0.6199, p

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Can AI grade like a professor? comparing artificial intelligence and faculty scoring of medical student short-answer clinical reasoning exams.

Many medical schools primarily use multiple-choice questions (MCQs) in pre-clinical assessments due to their efficiency and consistency. However, while MCQs are easy to grade, they often fall short in evaluating higher-order reasoning and understanding student thought processes. Despite these limitations, MCQs remain popular because alternative assessments require more time and resources to grade. This study explored whether OpenAI's GPT-4o Large Language Model (LLM) could be used to effectively grade narrative short answer questions (SAQs) in case-based learning (CBL) exams when compared to faculty graders. The primary outcome was equivalence of LLM grading, assessed using a bootstrapping procedure to calculate 95% confidence intervals (CIs) for mean score differences. Equivalence was defined as the entire 95% CI falling within a ± 5% margin. Secondary outcomes included grading precision, subgroup analysis by Bloom's taxonomy, and correlation between question complexity and LLM performance. Analysis of 1,450 responses showed LLM scores were equivalent to faculty scores overall (mean difference: -0.55%, 95% CI: -1.53%, + 0.45%). Equivalence was also demonstrated for Remembering, Applying, and Analyzing questions, however, discrepancies were observed for Understanding and Evaluating questions. AI grading demonstrated high precision (ICC = 0.993, 95% CI: 0.992-0.994). Greater differences between LLM and faculty scores were found for more difficult questions (R2 = 0.6199, p < 0.0001). LLM grading could serve as a tool for preliminary scoring of student assessments, enhancing SAQ grading efficiency and improving undergraduate medical education examination quality. Secondary outcome findings emphasize the need to use these tools in combination with, not as a replacement for, faculty involvement in the grading process.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Advances in Health Sciences Education 医学-卫生保健

CiteScore

6.90

自引率

12.50%

发文量

审稿时长

>12 weeks

期刊介绍： Advances in Health Sciences Education is a forum for scholarly and state-of-the art research into all aspects of health sciences education. It will publish empirical studies as well as discussions of theoretical issues and practical implications. The primary focus of the Journal is linking theory to practice, thus priority will be given to papers that have a sound theoretical basis and strong methodology.