Kathleen Lockhart, Ashan Canagasingham, Wenjie Zhong, Darius Ashrafi, Brayden March, Dane Cole‐Clark, Alice Grant, Amanda Chung
{"title":"Can ChatGPT pass the urology fellowship examination? Artificial intelligence capability in surgical training assessment","authors":"Kathleen Lockhart, Ashan Canagasingham, Wenjie Zhong, Darius Ashrafi, Brayden March, Dane Cole‐Clark, Alice Grant, Amanda Chung","doi":"10.1111/bju.16806","DOIUrl":null,"url":null,"abstract":"ObjectivesTo assess the performance of ChatGPT compared to human trainees in the Australian Urology written fellowship examination (essay format).Materials and MethodsEach examination was marked independently by two blinded examining urologists and assessed for: overall pass/failure; proportion of passing questions; and adjusted aggregate score. Examining urologists also made a blinded judgement as to authorship (artificial intelligence [AI] or trainee).ResultsA total of 20 examination papers were marked; 10 completed by urology trainees and 10 by AI platforms (half each on ChatGPT‐3.5 and ‐4.0). Overall, 9/10 of trainees successfully passed the urology fellowship, whereas only 6/10 of ChatGPT examinations passed (<jats:italic>P</jats:italic> = 0.3). Of the ChatGPT failing examinations, 3/4 were undertaken by the ChatGPT‐3.5 platform. The proportion of passing questions per examination was higher in trainees compared to ChatGPT: mean 89.4% vs 80.9% (<jats:italic>P</jats:italic> = 0.2). The adjusted aggregate scores of trainees were also higher than those of ChatGPT by a small margin: mean 79.2% vs 78.1% (<jats:italic>P</jats:italic> = 0.8). ChatGPT‐3.5 and ChatGPT‐4.0 achieved similar aggregate scores (78.9% and 77.4%, <jats:italic>P</jats:italic> = 0.8). However, ChatGPT‐3.5 had a lower percentage of passing questions per examination: mean 79.6% vs 82.1% (<jats:italic>P</jats:italic> = 0.8). Two examinations were incorrectly assigned by examining urologists (both trainee candidates perceived to be ChatGPT); therefore, the sensitivity for identifying ChatGPT authorship was 100% and overall accuracy was 91.7%.ConclusionOverall, ChatGPT did not perform as well as human trainees in the Australian Urology fellowship written examination. Examiners were able to identify AI‐generated answers with a high degree of accuracy.","PeriodicalId":8985,"journal":{"name":"BJU International","volume":"44 1","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2025-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BJU International","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1111/bju.16806","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
ObjectivesTo assess the performance of ChatGPT compared to human trainees in the Australian Urology written fellowship examination (essay format).Materials and MethodsEach examination was marked independently by two blinded examining urologists and assessed for: overall pass/failure; proportion of passing questions; and adjusted aggregate score. Examining urologists also made a blinded judgement as to authorship (artificial intelligence [AI] or trainee).ResultsA total of 20 examination papers were marked; 10 completed by urology trainees and 10 by AI platforms (half each on ChatGPT‐3.5 and ‐4.0). Overall, 9/10 of trainees successfully passed the urology fellowship, whereas only 6/10 of ChatGPT examinations passed (P = 0.3). Of the ChatGPT failing examinations, 3/4 were undertaken by the ChatGPT‐3.5 platform. The proportion of passing questions per examination was higher in trainees compared to ChatGPT: mean 89.4% vs 80.9% (P = 0.2). The adjusted aggregate scores of trainees were also higher than those of ChatGPT by a small margin: mean 79.2% vs 78.1% (P = 0.8). ChatGPT‐3.5 and ChatGPT‐4.0 achieved similar aggregate scores (78.9% and 77.4%, P = 0.8). However, ChatGPT‐3.5 had a lower percentage of passing questions per examination: mean 79.6% vs 82.1% (P = 0.8). Two examinations were incorrectly assigned by examining urologists (both trainee candidates perceived to be ChatGPT); therefore, the sensitivity for identifying ChatGPT authorship was 100% and overall accuracy was 91.7%.ConclusionOverall, ChatGPT did not perform as well as human trainees in the Australian Urology fellowship written examination. Examiners were able to identify AI‐generated answers with a high degree of accuracy.
期刊介绍:
BJUI is one of the most highly respected medical journals in the world, with a truly international range of published papers and appeal. Every issue gives invaluable practical information in the form of original articles, reviews, comments, surgical education articles, and translational science articles in the field of urology. BJUI employs topical sections, and is in full colour, making it easier to browse or search for something specific.