Lu Wang , Yuqiang Mao , Lin Wang , Yujie Sun , Jiangdian Song , Yang Zhang
{"title":"Suitability of GPT-4o as an evaluator of cardiopulmonary resuscitation skills examinations","authors":"Lu Wang , Yuqiang Mao , Lin Wang , Yujie Sun , Jiangdian Song , Yang Zhang","doi":"10.1016/j.resuscitation.2024.110404","DOIUrl":null,"url":null,"abstract":"<div><h3>Aim</h3><div>To assess the accuracy and reliability of GPT-4o for scoring examinees’ performance on cardiopulmonary resuscitation (CPR) skills tests.</div></div><div><h3>Methods</h3><div>This study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o’s reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o’s accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss’ Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.</div></div><div><h3>Results</h3><div>The mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o’s vs. junior experts’ scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o’s vs. senior experts’ scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66–4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00–4.67, 4.29 [0.50]) for the junior and senior experts, respectively.</div></div><div><h3>Conclusions</h3><div>GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.</div></div>","PeriodicalId":21052,"journal":{"name":"Resuscitation","volume":"204 ","pages":"Article 110404"},"PeriodicalIF":6.5000,"publicationDate":"2024-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Resuscitation","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0300957224002983","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CRITICAL CARE MEDICINE","Score":null,"Total":0}
引用次数: 0
Abstract
Aim
To assess the accuracy and reliability of GPT-4o for scoring examinees’ performance on cardiopulmonary resuscitation (CPR) skills tests.
Methods
This study included six experts certified to supervise the national medical licensing examination (three junior and three senior) who reviewed the CPR skills test videos across 103 examinees. All videos reviewed by the experts were subjected to automated assessment by GPT-4o. Both the experts and GPT-4o scored the videos across four sections: patient assessment, chest compressions, rescue breathing, and repeated operations. The experts subsequently rated GPT-4o’s reliability on a 5-point Likert scale (1, completely unreliable; 5, completely reliable). GPT-4o’s accuracy was evaluated using the intraclass correlation coefficient (for the first three sections) and Fleiss’ Kappa (for the last section) to assess the agreement between its scores vs. those of the experts.
Results
The mean accuracy scores for the patient assessment, chest compressions, rescue breathing, and repeated operation sections were 0.65, 0.58, 0.60, and 0.31, respectively, when comparing the GPT-4o’s vs. junior experts’ scores and 0.75, 0.65, 0.72, and 0.41, respectively, when comparing the GPT-4o’s vs. senior experts’ scores. For reliability, the median Likert scale scores were 4.00 (interquartile range [IQR] = 3.66–4.33, mean [standard deviation] = 3.95 [0.55]) and 4.33 (4.00–4.67, 4.29 [0.50]) for the junior and senior experts, respectively.
Conclusions
GPT-4o demonstrated a level of accuracy that was similar to that of senior experts in examining CPR skills examination videos. The results demonstrate the potential for deploying this large language model in medical examination settings.
期刊介绍:
Resuscitation is a monthly international and interdisciplinary medical journal. The papers published deal with the aetiology, pathophysiology and prevention of cardiac arrest, resuscitation training, clinical resuscitation, and experimental resuscitation research, although papers relating to animal studies will be published only if they are of exceptional interest and related directly to clinical cardiopulmonary resuscitation. Papers relating to trauma are published occasionally but the majority of these concern traumatic cardiac arrest.