Can ChatGPT pass the urology fellowship examination? Artificial intelligence capability in surgical training assessment

IF 3.7 2区 医学 Q1 UROLOGY & NEPHROLOGY
Kathleen Lockhart, Ashan Canagasingham, Wenjie Zhong, Darius Ashrafi, Brayden March, Dane Cole‐Clark, Alice Grant, Amanda Chung
{"title":"Can ChatGPT pass the urology fellowship examination? Artificial intelligence capability in surgical training assessment","authors":"Kathleen Lockhart, Ashan Canagasingham, Wenjie Zhong, Darius Ashrafi, Brayden March, Dane Cole‐Clark, Alice Grant, Amanda Chung","doi":"10.1111/bju.16806","DOIUrl":null,"url":null,"abstract":"ObjectivesTo assess the performance of ChatGPT compared to human trainees in the Australian Urology written fellowship examination (essay format).Materials and MethodsEach examination was marked independently by two blinded examining urologists and assessed for: overall pass/failure; proportion of passing questions; and adjusted aggregate score. Examining urologists also made a blinded judgement as to authorship (artificial intelligence [AI] or trainee).ResultsA total of 20 examination papers were marked; 10 completed by urology trainees and 10 by AI platforms (half each on ChatGPT‐3.5 and ‐4.0). Overall, 9/10 of trainees successfully passed the urology fellowship, whereas only 6/10 of ChatGPT examinations passed (<jats:italic>P</jats:italic> = 0.3). Of the ChatGPT failing examinations, 3/4 were undertaken by the ChatGPT‐3.5 platform. The proportion of passing questions per examination was higher in trainees compared to ChatGPT: mean 89.4% vs 80.9% (<jats:italic>P</jats:italic> = 0.2). The adjusted aggregate scores of trainees were also higher than those of ChatGPT by a small margin: mean 79.2% vs 78.1% (<jats:italic>P</jats:italic> = 0.8). ChatGPT‐3.5 and ChatGPT‐4.0 achieved similar aggregate scores (78.9% and 77.4%, <jats:italic>P</jats:italic> = 0.8). However, ChatGPT‐3.5 had a lower percentage of passing questions per examination: mean 79.6% vs 82.1% (<jats:italic>P</jats:italic> = 0.8). Two examinations were incorrectly assigned by examining urologists (both trainee candidates perceived to be ChatGPT); therefore, the sensitivity for identifying ChatGPT authorship was 100% and overall accuracy was 91.7%.ConclusionOverall, ChatGPT did not perform as well as human trainees in the Australian Urology fellowship written examination. Examiners were able to identify AI‐generated answers with a high degree of accuracy.","PeriodicalId":8985,"journal":{"name":"BJU International","volume":"44 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BJU International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/bju.16806","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

ObjectivesTo assess the performance of ChatGPT compared to human trainees in the Australian Urology written fellowship examination (essay format).Materials and MethodsEach examination was marked independently by two blinded examining urologists and assessed for: overall pass/failure; proportion of passing questions; and adjusted aggregate score. Examining urologists also made a blinded judgement as to authorship (artificial intelligence [AI] or trainee).ResultsA total of 20 examination papers were marked; 10 completed by urology trainees and 10 by AI platforms (half each on ChatGPT‐3.5 and ‐4.0). Overall, 9/10 of trainees successfully passed the urology fellowship, whereas only 6/10 of ChatGPT examinations passed (P = 0.3). Of the ChatGPT failing examinations, 3/4 were undertaken by the ChatGPT‐3.5 platform. The proportion of passing questions per examination was higher in trainees compared to ChatGPT: mean 89.4% vs 80.9% (P = 0.2). The adjusted aggregate scores of trainees were also higher than those of ChatGPT by a small margin: mean 79.2% vs 78.1% (P = 0.8). ChatGPT‐3.5 and ChatGPT‐4.0 achieved similar aggregate scores (78.9% and 77.4%, P = 0.8). However, ChatGPT‐3.5 had a lower percentage of passing questions per examination: mean 79.6% vs 82.1% (P = 0.8). Two examinations were incorrectly assigned by examining urologists (both trainee candidates perceived to be ChatGPT); therefore, the sensitivity for identifying ChatGPT authorship was 100% and overall accuracy was 91.7%.ConclusionOverall, ChatGPT did not perform as well as human trainees in the Australian Urology fellowship written examination. Examiners were able to identify AI‐generated answers with a high degree of accuracy.
ChatGPT能通过泌尿外科医师资格考试吗?人工智能在外科训练评估中的应用
目的评估ChatGPT与人类学员在澳大利亚泌尿外科书面奖学金考试(论文形式)中的表现。材料和方法每项检查由两名盲法检查泌尿科医生独立评分,并评估:总体合格/不合格;通过率;以及调整后的总分。检查泌尿科医生也对作者(人工智能[AI]或实习生)做出了盲目的判断。结果共批改试卷20份;10个由泌尿外科实习生完成,10个由人工智能平台完成(ChatGPT‐3.5和‐4.0各占一半)。总体而言,9/10的受训者成功通过了泌尿科奖学金,而只有6/10的ChatGPT考试通过了(P = 0.3)。在ChatGPT未通过的考试中,3/4是由ChatGPT‐3.5平台进行的。与ChatGPT相比,学员每次考试的通过率更高:平均89.4% vs 80.9% (P = 0.2)。学员调整后的总分也略高于ChatGPT:平均79.2% vs 78.1% (P = 0.8)。ChatGPT‐3.5和ChatGPT‐4.0的总得分相似(78.9%和77.4%,P = 0.8)。然而,ChatGPT‐3.5的每次考试通过率较低:平均79.6% vs 82.1% (P = 0.8)。检查泌尿科医生错误地分配了两项检查(两名实习候选人都被认为是ChatGPT);因此,识别ChatGPT作者的敏感性为100%,总体准确率为91.7%。总的来说,ChatGPT在澳大利亚泌尿学奖学金笔试中的表现不如人类学员。考官能够高度准确地识别人工智能生成的答案。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
BJU International
BJU International 医学-泌尿学与肾脏学
CiteScore
9.10
自引率
4.40%
发文量
262
审稿时长
1 months
期刊介绍: BJUI is one of the most highly respected medical journals in the world, with a truly international range of published papers and appeal. Every issue gives invaluable practical information in the form of original articles, reviews, comments, surgical education articles, and translational science articles in the field of urology. BJUI employs topical sections, and is in full colour, making it easier to browse or search for something specific.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信