{"title":"How well does GPT-4 perform on an emergency medicine board exam? A comparative assessment.","authors":"Naser Almehairi, Gregory Clark, Seth Davis","doi":"10.1007/s43678-025-00951-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Recent advancements in artificial intelligence have shown promise in enhancing diagnostic precision within healthcare sectors. In emergency departments, artificial intelligence has demonstrated potential for improving triage, guiding the choice of radiologic imaging and crafting individualized medical notes and discharge summaries, including tailored care plans. Advances in generative artificial intelligence have led to the development of sophisticated models such as OpenAI's GPT-4. This study assessed the ability of generative artificial intelligence in diagnosis and management in emergency medicine. Specifically, we compared GPT-4 with the performance of emergency medicine trainees in Canada, as gauged by the Canadian In-Training Examination.</p><p><strong>Methods: </strong>We compared the performance of emergency medicine residents to GPT-4 on the Canadian in-training exams for the years 2021 and 2022. Each question was entered into a fresh GPT-4 chat and the first response was recorded without any prompting. GPT-4's responses were then assessed using the same marking grid that is employed for evaluating medical trainees. We then compared GPT-4'sscores to the average scores of each post-graduate year (PGY) level of residents across all FRCPC training programs. Ethical approval was obtained, then Canadian In-Training Examination committee provided exam questions and anonymized national results.</p><p><strong>Results: </strong>The participants in this study included 389 residents in 2021 and 333 residents in the 2022 exams. In 2021, mean trainee scores increased progressively across the levels, with PGY1 trainees scoring 48.0% (SD 15.6), PGY2 at 56.2% (SD 14.7), PGY3 at 59.8% (SD 16.7), PGY4 at 67.2% (12.3), and PGY5 at 70.1% (SD 12.5), whereas GPT-4 scored 88.7%. In 2022, a similar pattern, with PGY1 scoring 46.3% (SD 14.7), PGY2 at 51.8% (SD 14.7), PGY3 at 58.2% (SD 14.3), PGY4 at 66.2% (SD 15.3), and PGY5 at 64.3% (SD 8.5), while GPT-4 scored 82.0%.</p><p><strong>Conclusions: </strong>GPT-4 has shown impressive capabilities, surpassing the performance of medical trainees at different postgraduate levels in the clinical written exam. These findings highlight the potential of artificial intelligence to serve as a valuable support tool in medical practice. However, it should be used with caution and must not substitute for established, evidence-based medical resources.</p>","PeriodicalId":93937,"journal":{"name":"CJEM","volume":" ","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"CJEM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s43678-025-00951-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Recent advancements in artificial intelligence have shown promise in enhancing diagnostic precision within healthcare sectors. In emergency departments, artificial intelligence has demonstrated potential for improving triage, guiding the choice of radiologic imaging and crafting individualized medical notes and discharge summaries, including tailored care plans. Advances in generative artificial intelligence have led to the development of sophisticated models such as OpenAI's GPT-4. This study assessed the ability of generative artificial intelligence in diagnosis and management in emergency medicine. Specifically, we compared GPT-4 with the performance of emergency medicine trainees in Canada, as gauged by the Canadian In-Training Examination.
Methods: We compared the performance of emergency medicine residents to GPT-4 on the Canadian in-training exams for the years 2021 and 2022. Each question was entered into a fresh GPT-4 chat and the first response was recorded without any prompting. GPT-4's responses were then assessed using the same marking grid that is employed for evaluating medical trainees. We then compared GPT-4'sscores to the average scores of each post-graduate year (PGY) level of residents across all FRCPC training programs. Ethical approval was obtained, then Canadian In-Training Examination committee provided exam questions and anonymized national results.
Results: The participants in this study included 389 residents in 2021 and 333 residents in the 2022 exams. In 2021, mean trainee scores increased progressively across the levels, with PGY1 trainees scoring 48.0% (SD 15.6), PGY2 at 56.2% (SD 14.7), PGY3 at 59.8% (SD 16.7), PGY4 at 67.2% (12.3), and PGY5 at 70.1% (SD 12.5), whereas GPT-4 scored 88.7%. In 2022, a similar pattern, with PGY1 scoring 46.3% (SD 14.7), PGY2 at 51.8% (SD 14.7), PGY3 at 58.2% (SD 14.3), PGY4 at 66.2% (SD 15.3), and PGY5 at 64.3% (SD 8.5), while GPT-4 scored 82.0%.
Conclusions: GPT-4 has shown impressive capabilities, surpassing the performance of medical trainees at different postgraduate levels in the clinical written exam. These findings highlight the potential of artificial intelligence to serve as a valuable support tool in medical practice. However, it should be used with caution and must not substitute for established, evidence-based medical resources.