Wuan Shuen Yap, Pui San Saw, Li Ling Yeap, Shaun Wen Huey Lee, Wei Jin Wong, Ronald Fook Seng Lee
{"title":"Comparison between GPT-4 and human raters in grading pharmacy students' exam responses in Malaysia: a cross-sectional study.","authors":"Wuan Shuen Yap, Pui San Saw, Li Ling Yeap, Shaun Wen Huey Lee, Wei Jin Wong, Ronald Fook Seng Lee","doi":"10.3352/jeehp.2025.22.20","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Manual grading is time-consuming and prone to inconsistencies, prompting the exploration of generative artificial intelligence tools such as GPT-4 to enhance efficiency and reliability. This study investigated GPT-4's potential in grading pharmacy students' exam responses, focusing on the impact of optimized prompts. Specifically, it evaluated the alignment between GPT-4 and human raters, assessed GPT-4's consistency over time, and determined its error rates in grading pharmacy students' exam responses.</p><p><strong>Methods: </strong>We conducted a comparative study using past exam responses graded by university-trained raters and by GPT-4. Responses were randomized before evaluation by GPT-4, accessed via a Plus account between April and September 2024. Prompt optimization was performed on 16 responses, followed by evaluation of 3 prompt delivery methods. We then applied the optimized approach across 4 item types. Intraclass correlation coefficients and error analyses were used to assess consistency and agreement between GPT-4 and human ratings.</p><p><strong>Results: </strong>GPT-4's ratings aligned reasonably well with human raters, demonstrating moderate to excellent reliability (intraclass correlation coefficient=0.617-0.933), depending on item type and the optimized prompt. When stratified by grade bands, GPT-4 was less consistent in marking high-scoring responses (Z=-5.71-4.62, P<0.001). Overall, despite achieving substantial alignment with human raters in many cases, discrepancies across item types and a tendency to commit basic errors necessitate continued educator involvement to ensure grading accuracy.</p><p><strong>Conclusion: </strong>With optimized prompts, GPT-4 shows promise as a supportive tool for grading pharmacy students' exam responses, particularly for objective tasks. However, its limitations-including errors and variability in grading high-scoring responses-require ongoing human oversight. Future research should explore advanced generative artificial intelligence models and broader assessment formats to further enhance grading reliability.</p>","PeriodicalId":46098,"journal":{"name":"Journal of Educational Evaluation for Health Professions","volume":"22 ","pages":"20"},"PeriodicalIF":3.7000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Educational Evaluation for Health Professions","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3352/jeehp.2025.22.20","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/28 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Manual grading is time-consuming and prone to inconsistencies, prompting the exploration of generative artificial intelligence tools such as GPT-4 to enhance efficiency and reliability. This study investigated GPT-4's potential in grading pharmacy students' exam responses, focusing on the impact of optimized prompts. Specifically, it evaluated the alignment between GPT-4 and human raters, assessed GPT-4's consistency over time, and determined its error rates in grading pharmacy students' exam responses.
Methods: We conducted a comparative study using past exam responses graded by university-trained raters and by GPT-4. Responses were randomized before evaluation by GPT-4, accessed via a Plus account between April and September 2024. Prompt optimization was performed on 16 responses, followed by evaluation of 3 prompt delivery methods. We then applied the optimized approach across 4 item types. Intraclass correlation coefficients and error analyses were used to assess consistency and agreement between GPT-4 and human ratings.
Results: GPT-4's ratings aligned reasonably well with human raters, demonstrating moderate to excellent reliability (intraclass correlation coefficient=0.617-0.933), depending on item type and the optimized prompt. When stratified by grade bands, GPT-4 was less consistent in marking high-scoring responses (Z=-5.71-4.62, P<0.001). Overall, despite achieving substantial alignment with human raters in many cases, discrepancies across item types and a tendency to commit basic errors necessitate continued educator involvement to ensure grading accuracy.
Conclusion: With optimized prompts, GPT-4 shows promise as a supportive tool for grading pharmacy students' exam responses, particularly for objective tasks. However, its limitations-including errors and variability in grading high-scoring responses-require ongoing human oversight. Future research should explore advanced generative artificial intelligence models and broader assessment formats to further enhance grading reliability.
期刊介绍:
Journal of Educational Evaluation for Health Professions aims to provide readers the state-of-the art practical information on the educational evaluation for health professions so that to increase the quality of undergraduate, graduate, and continuing education. It is specialized in educational evaluation including adoption of measurement theory to medical health education, promotion of high stakes examination such as national licensing examinations, improvement of nationwide or international programs of education, computer-based testing, computerized adaptive testing, and medical health regulatory bodies. Its field comprises a variety of professions that address public medical health as following but not limited to: Care workers Dental hygienists Dental technicians Dentists Dietitians Emergency medical technicians Health educators Medical record technicians Medical technologists Midwives Nurses Nursing aides Occupational therapists Opticians Oriental medical doctors Oriental medicine dispensers Oriental pharmacists Pharmacists Physical therapists Physicians Prosthetists and Orthotists Radiological technologists Rehabilitation counselor Sanitary technicians Speech-language therapists.