How well does GPT-4 perform on an emergency medicine board exam? A comparative assessment.

IF 2

CJEM Pub Date : 2025-07-03 DOI:10.1007/s43678-025-00951-0

Naser Almehairi, Gregory Clark, Seth Davis

{"title":"How well does GPT-4 perform on an emergency medicine board exam? A comparative assessment.","authors":"Naser Almehairi, Gregory Clark, Seth Davis","doi":"10.1007/s43678-025-00951-0","DOIUrl":null,"url":null,"abstract":"Background: Recent advancements in artificial intelligence have shown promise in enhancing diagnostic precision within healthcare sectors. In emergency departments, artificial intelligence has demonstrated potential for improving triage, guiding the choice of radiologic imaging and crafting individualized medical notes and discharge summaries, including tailored care plans. Advances in generative artificial intelligence have led to the development of sophisticated models such as OpenAI's GPT-4. This study assessed the ability of generative artificial intelligence in diagnosis and management in emergency medicine. Specifically, we compared GPT-4 with the performance of emergency medicine trainees in Canada, as gauged by the Canadian In-Training Examination.Methods: We compared the performance of emergency medicine residents to GPT-4 on the Canadian in-training exams for the years 2021 and 2022. Each question was entered into a fresh GPT-4 chat and the first response was recorded without any prompting. GPT-4's responses were then assessed using the same marking grid that is employed for evaluating medical trainees. We then compared GPT-4'sscores to the average scores of each post-graduate year (PGY) level of residents across all FRCPC training programs. Ethical approval was obtained, then Canadian In-Training Examination committee provided exam questions and anonymized national results.Results: The participants in this study included 389 residents in 2021 and 333 residents in the 2022 exams. In 2021, mean trainee scores increased progressively across the levels, with PGY1 trainees scoring 48.0% (SD 15.6), PGY2 at 56.2% (SD 14.7), PGY3 at 59.8% (SD 16.7), PGY4 at 67.2% (12.3), and PGY5 at 70.1% (SD 12.5), whereas GPT-4 scored 88.7%. In 2022, a similar pattern, with PGY1 scoring 46.3% (SD 14.7), PGY2 at 51.8% (SD 14.7), PGY3 at 58.2% (SD 14.3), PGY4 at 66.2% (SD 15.3), and PGY5 at 64.3% (SD 8.5), while GPT-4 scored 82.0%.Conclusions: GPT-4 has shown impressive capabilities, surpassing the performance of medical trainees at different postgraduate levels in the clinical written exam. These findings highlight the potential of artificial intelligence to serve as a valuable support tool in medical practice. However, it should be used with caution and must not substitute for established, evidence-based medical resources.","PeriodicalId":93937,"journal":{"name":"CJEM","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CJEM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s43678-025-00951-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Recent advancements in artificial intelligence have shown promise in enhancing diagnostic precision within healthcare sectors. In emergency departments, artificial intelligence has demonstrated potential for improving triage, guiding the choice of radiologic imaging and crafting individualized medical notes and discharge summaries, including tailored care plans. Advances in generative artificial intelligence have led to the development of sophisticated models such as OpenAI's GPT-4. This study assessed the ability of generative artificial intelligence in diagnosis and management in emergency medicine. Specifically, we compared GPT-4 with the performance of emergency medicine trainees in Canada, as gauged by the Canadian In-Training Examination.

Methods: We compared the performance of emergency medicine residents to GPT-4 on the Canadian in-training exams for the years 2021 and 2022. Each question was entered into a fresh GPT-4 chat and the first response was recorded without any prompting. GPT-4's responses were then assessed using the same marking grid that is employed for evaluating medical trainees. We then compared GPT-4'sscores to the average scores of each post-graduate year (PGY) level of residents across all FRCPC training programs. Ethical approval was obtained, then Canadian In-Training Examination committee provided exam questions and anonymized national results.

Results: The participants in this study included 389 residents in 2021 and 333 residents in the 2022 exams. In 2021, mean trainee scores increased progressively across the levels, with PGY1 trainees scoring 48.0% (SD 15.6), PGY2 at 56.2% (SD 14.7), PGY3 at 59.8% (SD 16.7), PGY4 at 67.2% (12.3), and PGY5 at 70.1% (SD 12.5), whereas GPT-4 scored 88.7%. In 2022, a similar pattern, with PGY1 scoring 46.3% (SD 14.7), PGY2 at 51.8% (SD 14.7), PGY3 at 58.2% (SD 14.3), PGY4 at 66.2% (SD 15.3), and PGY5 at 64.3% (SD 8.5), while GPT-4 scored 82.0%.

Conclusions: GPT-4 has shown impressive capabilities, surpassing the performance of medical trainees at different postgraduate levels in the clinical written exam. These findings highlight the potential of artificial intelligence to serve as a valuable support tool in medical practice. However, it should be used with caution and must not substitute for established, evidence-based medical resources.

查看原文本刊更多论文

GPT-4在急诊医学委员会考试中的表现如何？比较评估。

背景：人工智能的最新进展在提高医疗保健部门的诊断精度方面显示出了希望。在急诊科，人工智能已经显示出改善分诊、指导放射成像选择、制作个性化医疗记录和出院摘要（包括量身定制的护理计划）的潜力。生成式人工智能的进步导致了复杂模型的发展，比如OpenAI的GPT-4。本研究评估了生成式人工智能在急诊医学诊断和管理中的能力。具体而言，我们将GPT-4与加拿大急诊医学学员的表现进行了比较，这是由加拿大培训考试衡量的。方法：我们比较了急诊医学住院医师在2021年和2022年加拿大培训考试中的GPT-4表现。每个问题都被输入到一个新的GPT-4聊天中，并且在没有任何提示的情况下记录了第一个回答。GPT-4的反应，然后评估使用相同的标记网格用于评估医疗学员。然后，我们将GPT-4的分数与所有FRCPC培训项目中每个研究生阶段（PGY）水平的住院医师的平均分数进行比较。获得伦理批准后，加拿大在职考试委员会提供考试题目和匿名国家成绩。结果：本研究的参与者包括2021年的389名居民和2022年的333名居民。在2021年，学员的平均得分逐步提高，PGY1学员的得分为48.0% (SD 15.6)， PGY2学员为56.2% (SD 14.7)， PGY3学员为59.8% (SD 16.7)， PGY4学员为67.2% (12.3)，PGY5学员为70.1% (SD 12.5)，而GPT-4学员的得分为88.7%。2022年，PGY1得分为46.3% (SD 14.7)， PGY2得分为51.8% (SD 14.7)， PGY3得分为58.2% (SD 14.3)， PGY4得分为66.2% (SD 15.3)， PGY5得分为64.3% (SD 8.5)，而ppt -4得分为82.0%。结论：GPT-4表现出令人印象深刻的能力，超过了不同研究生水平的医学实习生在临床笔试中的表现。这些发现突出了人工智能在医疗实践中作为一种有价值的支持工具的潜力。然而，它应该谨慎使用，不能替代现有的循证医学资源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

CJEM

自引率

0.00%

发文量