Comparison between GPT-4 and human raters in grading pharmacy students' exam responses in Malaysia: a cross-sectional study.

IF 3.7 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

Journal of Educational Evaluation for Health Professions Pub Date : 2025-01-01 Epub Date: 2025-07-28 DOI:10.3352/jeehp.2025.22.20

Wuan Shuen Yap, Pui San Saw, Li Ling Yeap, Shaun Wen Huey Lee, Wei Jin Wong, Ronald Fook Seng Lee

{"title":"Comparison between GPT-4 and human raters in grading pharmacy students' exam responses in Malaysia: a cross-sectional study.","authors":"Wuan Shuen Yap, Pui San Saw, Li Ling Yeap, Shaun Wen Huey Lee, Wei Jin Wong, Ronald Fook Seng Lee","doi":"10.3352/jeehp.2025.22.20","DOIUrl":null,"url":null,"abstract":"Purpose: Manual grading is time-consuming and prone to inconsistencies, prompting the exploration of generative artificial intelligence tools such as GPT-4 to enhance efficiency and reliability. This study investigated GPT-4's potential in grading pharmacy students' exam responses, focusing on the impact of optimized prompts. Specifically, it evaluated the alignment between GPT-4 and human raters, assessed GPT-4's consistency over time, and determined its error rates in grading pharmacy students' exam responses.Methods: We conducted a comparative study using past exam responses graded by university-trained raters and by GPT-4. Responses were randomized before evaluation by GPT-4, accessed via a Plus account between April and September 2024. Prompt optimization was performed on 16 responses, followed by evaluation of 3 prompt delivery methods. We then applied the optimized approach across 4 item types. Intraclass correlation coefficients and error analyses were used to assess consistency and agreement between GPT-4 and human ratings.Results: GPT-4's ratings aligned reasonably well with human raters, demonstrating moderate to excellent reliability (intraclass correlation coefficient=0.617-0.933), depending on item type and the optimized prompt. When stratified by grade bands, GPT-4 was less consistent in marking high-scoring responses (Z=-5.71-4.62, P<0.001). Overall, despite achieving substantial alignment with human raters in many cases, discrepancies across item types and a tendency to commit basic errors necessitate continued educator involvement to ensure grading accuracy.Conclusion: With optimized prompts, GPT-4 shows promise as a supportive tool for grading pharmacy students' exam responses, particularly for objective tasks. However, its limitations-including errors and variability in grading high-scoring responses-require ongoing human oversight. Future research should explore advanced generative artificial intelligence models and broader assessment formats to further enhance grading reliability.","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"20"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Evaluation for Health Professions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3352/jeehp.2025.22.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Manual grading is time-consuming and prone to inconsistencies, prompting the exploration of generative artificial intelligence tools such as GPT-4 to enhance efficiency and reliability. This study investigated GPT-4's potential in grading pharmacy students' exam responses, focusing on the impact of optimized prompts. Specifically, it evaluated the alignment between GPT-4 and human raters, assessed GPT-4's consistency over time, and determined its error rates in grading pharmacy students' exam responses.

Methods: We conducted a comparative study using past exam responses graded by university-trained raters and by GPT-4. Responses were randomized before evaluation by GPT-4, accessed via a Plus account between April and September 2024. Prompt optimization was performed on 16 responses, followed by evaluation of 3 prompt delivery methods. We then applied the optimized approach across 4 item types. Intraclass correlation coefficients and error analyses were used to assess consistency and agreement between GPT-4 and human ratings.

Results: GPT-4's ratings aligned reasonably well with human raters, demonstrating moderate to excellent reliability (intraclass correlation coefficient=0.617-0.933), depending on item type and the optimized prompt. When stratified by grade bands, GPT-4 was less consistent in marking high-scoring responses (Z=-5.71-4.62, P<0.001). Overall, despite achieving substantial alignment with human raters in many cases, discrepancies across item types and a tendency to commit basic errors necessitate continued educator involvement to ensure grading accuracy.

Conclusion: With optimized prompts, GPT-4 shows promise as a supportive tool for grading pharmacy students' exam responses, particularly for objective tasks. However, its limitations-including errors and variability in grading high-scoring responses-require ongoing human oversight. Future research should explore advanced generative artificial intelligence models and broader assessment formats to further enhance grading reliability.

查看原文本刊更多论文

GPT-4和人类评分者在马来西亚对药学学生考试反应评分的比较：一项横断面研究。

目的：人工评分耗时长，且容易出现不一致性，促使我们探索生成式人工智能工具，如GPT-4，以提高效率和可靠性。本研究探讨了GPT-4对药学学生考试成绩评分的潜力，重点关注优化提示的影响。具体来说，它评估了GPT-4与人类评分者之间的一致性，评估了GPT-4随时间的一致性，并确定了其在给药学学生考试反应评分时的错误率。方法：我们使用由大学训练的评分员和GPT-4评分的过去的考试答案进行了比较研究。在GPT-4评估之前，应答是随机的，并在2024年4月至9月期间通过Plus账户访问。对16份问卷进行即时优化，并对3种即时送达方式进行评价。然后，我们将优化的方法应用于4个项目类型。使用类内相关系数和误差分析来评估GPT-4和人类评分之间的一致性和一致性。结果：GPT-4的评分与人类评分者相当一致，表现出中等至优异的信度（类内相关系数=0.617-0.933），取决于项目类型和优化提示。当按年级等级分层时，GPT-4在评分高分回答时不太一致（Z=-5.71-4.62）。结论：通过优化提示，GPT-4有望成为评分药学学生考试回答的辅助工具，特别是对于客观任务。然而，它的局限性——包括评分高分反应的错误和可变性——需要持续的人工监督。未来的研究应探索先进的生成式人工智能模型和更广泛的评估格式，以进一步提高评分的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Educational Evaluation for Health Professions EDUCATION, SCIENTIFIC DISCIPLINES-

CiteScore

9.60

自引率

9.10%

发文量

审稿时长

5 weeks

期刊介绍： Journal of Educational Evaluation for Health Professions aims to provide readers the state-of-the art practical information on the educational evaluation for health professions so that to increase the quality of undergraduate, graduate, and continuing education. It is specialized in educational evaluation including adoption of measurement theory to medical health education, promotion of high stakes examination such as national licensing examinations, improvement of nationwide or international programs of education, computer-based testing, computerized adaptive testing, and medical health regulatory bodies. Its field comprises a variety of professions that address public medical health as following but not limited to: Care workers Dental hygienists Dental technicians Dentists Dietitians Emergency medical technicians Health educators Medical record technicians Medical technologists Midwives Nurses Nursing aides Occupational therapists Opticians Oriental medical doctors Oriental medicine dispensers Oriental pharmacists Pharmacists Physical therapists Physicians Prosthetists and Orthotists Radiological technologists Rehabilitation counselor Sanitary technicians Speech-language therapists.