Can ChatGPT pass the urology fellowship examination? Artificial intelligence capability in surgical training assessment

IF 3.7 2区医学 Q1 UROLOGY & NEPHROLOGY

BJU International Pub Date : 2025-06-20 DOI:10.1111/bju.16806

Kathleen Lockhart, Ashan Canagasingham, Wenjie Zhong, Darius Ashrafi, Brayden March, Dane Cole‐Clark, Alice Grant, Amanda Chung

{"title":"Can ChatGPT pass the urology fellowship examination? Artificial intelligence capability in surgical training assessment","authors":"Kathleen Lockhart, Ashan Canagasingham, Wenjie Zhong, Darius Ashrafi, Brayden March, Dane Cole‐Clark, Alice Grant, Amanda Chung","doi":"10.1111/bju.16806","DOIUrl":null,"url":null,"abstract":"ObjectivesTo assess the performance of ChatGPT compared to human trainees in the Australian Urology written fellowship examination (essay format).Materials and MethodsEach examination was marked independently by two blinded examining urologists and assessed for: overall pass/failure; proportion of passing questions; and adjusted aggregate score. Examining urologists also made a blinded judgement as to authorship (artificial intelligence [AI] or trainee).ResultsA total of 20 examination papers were marked; 10 completed by urology trainees and 10 by AI platforms (half each on ChatGPT‐3.5 and ‐4.0). Overall, 9/10 of trainees successfully passed the urology fellowship, whereas only 6/10 of ChatGPT examinations passed (P = 0.3). Of the ChatGPT failing examinations, 3/4 were undertaken by the ChatGPT‐3.5 platform. The proportion of passing questions per examination was higher in trainees compared to ChatGPT: mean 89.4% vs 80.9% (P = 0.2). The adjusted aggregate scores of trainees were also higher than those of ChatGPT by a small margin: mean 79.2% vs 78.1% (P = 0.8). ChatGPT‐3.5 and ChatGPT‐4.0 achieved similar aggregate scores (78.9% and 77.4%, P = 0.8). However, ChatGPT‐3.5 had a lower percentage of passing questions per examination: mean 79.6% vs 82.1% (P = 0.8). Two examinations were incorrectly assigned by examining urologists (both trainee candidates perceived to be ChatGPT); therefore, the sensitivity for identifying ChatGPT authorship was 100% and overall accuracy was 91.7%.ConclusionOverall, ChatGPT did not perform as well as human trainees in the Australian Urology fellowship written examination. Examiners were able to identify AI‐generated answers with a high degree of accuracy.","PeriodicalId":8985,"journal":{"name":"BJU International","volume":"44 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BJU International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/bju.16806","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

ObjectivesTo assess the performance of ChatGPT compared to human trainees in the Australian Urology written fellowship examination (essay format).Materials and MethodsEach examination was marked independently by two blinded examining urologists and assessed for: overall pass/failure; proportion of passing questions; and adjusted aggregate score. Examining urologists also made a blinded judgement as to authorship (artificial intelligence [AI] or trainee).ResultsA total of 20 examination papers were marked; 10 completed by urology trainees and 10 by AI platforms (half each on ChatGPT‐3.5 and ‐4.0). Overall, 9/10 of trainees successfully passed the urology fellowship, whereas only 6/10 of ChatGPT examinations passed (P = 0.3). Of the ChatGPT failing examinations, 3/4 were undertaken by the ChatGPT‐3.5 platform. The proportion of passing questions per examination was higher in trainees compared to ChatGPT: mean 89.4% vs 80.9% (P = 0.2). The adjusted aggregate scores of trainees were also higher than those of ChatGPT by a small margin: mean 79.2% vs 78.1% (P = 0.8). ChatGPT‐3.5 and ChatGPT‐4.0 achieved similar aggregate scores (78.9% and 77.4%, P = 0.8). However, ChatGPT‐3.5 had a lower percentage of passing questions per examination: mean 79.6% vs 82.1% (P = 0.8). Two examinations were incorrectly assigned by examining urologists (both trainee candidates perceived to be ChatGPT); therefore, the sensitivity for identifying ChatGPT authorship was 100% and overall accuracy was 91.7%.ConclusionOverall, ChatGPT did not perform as well as human trainees in the Australian Urology fellowship written examination. Examiners were able to identify AI‐generated answers with a high degree of accuracy.

查看原文本刊更多论文

ChatGPT能通过泌尿外科医师资格考试吗？人工智能在外科训练评估中的应用

目的评估ChatGPT与人类学员在澳大利亚泌尿外科书面奖学金考试（论文形式）中的表现。材料和方法每项检查由两名盲法检查泌尿科医生独立评分，并评估：总体合格/不合格；通过率；以及调整后的总分。检查泌尿科医生也对作者（人工智能[AI]或实习生）做出了盲目的判断。结果共批改试卷20份；10个由泌尿外科实习生完成，10个由人工智能平台完成（ChatGPT‐3.5和‐4.0各占一半）。总体而言，9/10的受训者成功通过了泌尿科奖学金，而只有6/10的ChatGPT考试通过了（P = 0.3）。在ChatGPT未通过的考试中，3/4是由ChatGPT‐3.5平台进行的。与ChatGPT相比，学员每次考试的通过率更高：平均89.4% vs 80.9% （P = 0.2）。学员调整后的总分也略高于ChatGPT：平均79.2% vs 78.1% （P = 0.8）。ChatGPT‐3.5和ChatGPT‐4.0的总得分相似（78.9%和77.4%,P = 0.8）。然而，ChatGPT‐3.5的每次考试通过率较低：平均79.6% vs 82.1% （P = 0.8）。检查泌尿科医生错误地分配了两项检查（两名实习候选人都被认为是ChatGPT）；因此，识别ChatGPT作者的敏感性为100%，总体准确率为91.7%。总的来说，ChatGPT在澳大利亚泌尿学奖学金笔试中的表现不如人类学员。考官能够高度准确地识别人工智能生成的答案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BJU International 医学-泌尿学与肾脏学

CiteScore

9.10

自引率

4.40%

发文量

262

审稿时长

1 months

期刊介绍： BJUI is one of the most highly respected medical journals in the world, with a truly international range of published papers and appeal. Every issue gives invaluable practical information in the form of original articles, reviews, comments, surgical education articles, and translational science articles in the field of urology. BJUI employs topical sections, and is in full colour, making it easier to browse or search for something specific.