{"title":"全面分析 2012-2023 年美国泌尿协会自我评估学习计划考试中 GPT-3.5 和 GPT-4 的成绩。","authors":"Ali Sherazi, David Canes","doi":"10.5489/cuaj.8526","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023.</p><p><strong>Methods: </strong>We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams.</p><p><strong>Results: </strong>The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001).</p><p><strong>Conclusions: </strong>GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.</p>","PeriodicalId":50613,"journal":{"name":"Cuaj-Canadian Urological Association Journal","volume":null,"pages":null},"PeriodicalIF":1.9000,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023.\",\"authors\":\"Ali Sherazi, David Canes\",\"doi\":\"10.5489/cuaj.8526\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023.</p><p><strong>Methods: </strong>We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams.</p><p><strong>Results: </strong>The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001).</p><p><strong>Conclusions: </strong>GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.</p>\",\"PeriodicalId\":50613,\"journal\":{\"name\":\"Cuaj-Canadian Urological Association Journal\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2023-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cuaj-Canadian Urological Association Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.5489/cuaj.8526\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cuaj-Canadian Urological Association Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5489/cuaj.8526","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}
Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023.
Introduction: Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023.
Methods: We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams.
Results: The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001).
Conclusions: GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.
期刊介绍:
CUAJ is a a peer-reviewed, open-access journal devoted to promoting the highest standard of urological patient care through the publication of timely, relevant, evidence-based research and advocacy information.