全面分析 2012-2023 年美国泌尿协会自我评估学习计划考试中 GPT-3.5 和 GPT-4 的成绩。

IF 1.9 4区医学 Q3 UROLOGY & NEPHROLOGY

Cuaj-Canadian Urological Association Journal Pub Date : 2023-12-21 DOI:10.5489/cuaj.8526

Ali Sherazi, David Canes

{"title":"全面分析 2012-2023 年美国泌尿协会自我评估学习计划考试中 GPT-3.5 和 GPT-4 的成绩。","authors":"Ali Sherazi, David Canes","doi":"10.5489/cuaj.8526","DOIUrl":null,"url":null,"abstract":"Introduction: Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023.Methods: We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams.Results: The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001).Conclusions: GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.","PeriodicalId":50613,"journal":{"name":"Cuaj-Canadian Urological Association Journal","volume":null,"pages":null},"PeriodicalIF":1.9000,"publicationDate":"2023-12-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023.\",\"authors\":\"Ali Sherazi, David Canes\",\"doi\":\"10.5489/cuaj.8526\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023.Methods: We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams.Results: The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001).Conclusions: GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.\",\"PeriodicalId\":50613,\"journal\":{\"name\":\"Cuaj-Canadian Urological Association Journal\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2023-12-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Cuaj-Canadian Urological Association Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.5489/cuaj.8526\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"UROLOGY & NEPHROLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Cuaj-Canadian Urological Association Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.5489/cuaj.8526","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"UROLOGY & NEPHROLOGY","Score":null,"Total":0}

引用次数: 0

摘要

引言人工智能（AI）应用，特别是生成式预训练变换器，已在医学教育和董事会式考试中显示出潜力。为了评估这种能力，我们进行了一项研究，比较了GPT-3.5和GPT-4在2012-2023年美国泌尿协会（AUA）2022年自我评估学习计划（SASP）考试中的表现：我们使用标准化的提示来管理 2012-2023 年美国泌尿外科协会 SASP 考试的试题，共 1679 道题。根据正确回答问题的数量评估了两种人工智能模型（GPT-3.5 和 GPT-4）的性能。使用费雪精确检验和独立样本 t 检验进行了统计分析，以比较 GPT-4 和 GPT-3.5 在不同考试年份和泌尿学主题领域的表现。百分位数分数无法计算，但美国医学会 SASP 考试成绩必须达到 50%，才能获得继续医学教育学分：分析表明，GPT-4 的成绩明显优于 GPT-3.5，除 2018 年外，GPT-4 在所有考试年份的得分均超过 50%，分数范围在 48-64% 之间。相比之下，GPT-3.5 的得分始终低于这一阈值，分数范围为 26-38%。GPT-4 的综合总分为 55%，显著高于 GPT-3.5 的 33%（几率比 [OR] 2.5，95% 置信区间 [CI]2.2-2.9，pConclusions：在 AUA SASP 考试中，GPT-4 的总体成绩、所有考试年份的成绩以及各泌尿学主题领域的成绩均明显高于 GPT-3.5。这表明，人工智能语言模型在回答临床泌尿学问题方面有了进步；但是，医学知识和临床推理的某些方面对人工智能语言模型来说仍然具有挑战性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Comprehensive analysis of the performance of GPT-3.5 and GPT-4 on the American Urological Association self-assessment study program exams from 2012-2023.

Introduction: Artificial intelligence (AI) applications, specifically generative pre-trained transformers, have shown potential in medical education and board-style examinations. To assess this capability, we conducted a study comparing the performance of GPT-3.5 and GPT-4 on the American Urological Association (AUA) 2022 self-assessment study program (SASP) exams from 2012-2023.

Methods: We used a standardized prompt to administer questions from the AUA SASP exams spanning 2012-2023, totalling 1679 questions. The performance of the two AI models, GPT-3.5 and GPT-4, was evaluated based on the number of questions answered correctly. Statistical analysis was performed using Fisher's exact test and independent sample t-tests to compare the performance of GPT-4 to that of GPT-3.5 among test years and urology topic areas. Percentile scores were not calculable, however, a score of 50% is required to acquire CME credits on AUA SASP exams.

Results: The analysis showed significantly superior performance by GPT-4, which scored above 50% across all exam years except 2018, with scores ranging from 48-64%. In contrast, GPT-3.5 consistently scored below this threshold, with scores ranging from 26-38%. The total combined score for GPT-4 was 55%, significantly higher than the 33% achieved by GPT-3.5 (odds ratio [OR] 2.5, 95% confidence interval [CI] 2.2-2.9, p<0.001). GPT-4 significantly outperformed GPT-3.5 among AUA SASP test years from 2012-2023 (mean difference 23, t(22) 14, 95% CI 19-26, p<0.001), as well as among urology topic areas (mean difference 21, t(52)=5.5, 95% CI 13-29, p<0.001).

Conclusions: GPT-4 scored significantly higher than GPT-3.5 on the AUA SASP exams in overall performance, across all test years, and in various urology topic areas. This suggests improvement in evolving AI language models in answering clinical urology questions; however, certain aspects of medical knowledge and clinical reasoning remain challenging for AI language models.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Cuaj-Canadian Urological Association Journal 医学-泌尿学与肾脏学

CiteScore

2.80

自引率

10.50%

发文量

167

审稿时长

>12 weeks

期刊介绍： CUAJ is a a peer-reviewed, open-access journal devoted to promoting the highest standard of urological patient care through the publication of timely, relevant, evidence-based research and advocacy information.