ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2024-11-06 DOI:10.2196/63430

Brenton T Bicknell, Danner Butler, Sydney Whalen, James Ricks, Cory J Dixon, Abigail B Clark, Olivia Spaedy, Adam Skelton, Neel Edupuganti, Lance Dzubinski, Hudson Tate, Garrett Dyess, Brenessa Lindeman, Lisa Soleymani Lehmann

{"title":"ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis.","authors":"Brenton T Bicknell, Danner Butler, Sydney Whalen, James Ricks, Cory J Dixon, Abigail B Clark, Olivia Spaedy, Adam Skelton, Neel Edupuganti, Lance Dzubinski, Hudson Tate, Garrett Dyess, Brenessa Lindeman, Lisa Soleymani Lehmann","doi":"10.2196/63430","DOIUrl":null,"url":null,"abstract":"Background: Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education.Objective: This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management.Methods: This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances.Results: GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3-60.3).Conclusions: GPT-4o's performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":"10 ","pages":"e63430"},"PeriodicalIF":3.2000,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11611793/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/63430","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education.

Objective: This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management.

Methods: This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances.

Results: GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3-60.3).

Conclusions: GPT-4o's performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.

查看原文本刊更多论文

ChatGPT-4 Omni 在 USMLE 学科和临床技能中的表现：比较分析。

背景：最近的研究，包括美国国家医学考试委员会（National Board of Medical Examiners）的研究，都强调了最近的大型语言模型（LLM），如 ChatGPT，在通过美国医学执照考试（USMLE）方面的卓越能力。然而，在详细分析 LLM 在特定医学内容领域的表现方面还存在空白，从而限制了对其在医学教育中潜在用途的评估：本研究旨在评估和比较历代 ChatGPT 版本（GPT-3.5、GPT-4 和 GPT-4 Omni）在 USMLE 学科、临床实习以及诊断和管理临床技能方面的准确性：本研究使用了 750 道基于临床小故事的选择题，以描述历代 ChatGPT 版本（ChatGPT 3.5 [GPT-3.5]、ChatGPT 4 [GPT-4]和 ChatGPT 4 Omni [GPT-4o]）在 USMLE 学科、临床实习和临床技能（诊断和管理）方面的表现。采用标准化方案评估准确性，并进行统计分析以比较模型的性能：结果：在750道选择题中，GPT-4o的准确率最高，达到90.4%，超过了分别为81.1%和60.0%的GPT-4和GPT-3.5。GPT-4o 在社会科学（95.5%）、行为与神经科学（94.2%）和药理学（93.2%）方面表现最佳。在临床技能方面，GPT-4o 的诊断准确率为 92.7%，管理准确率为 88.8%，明显高于其前身。值得注意的是，GPT-4o和GPT-4的准确率均明显高于医学生59.3%（95% CI 58.3-60.3）的平均准确率：结论：GPT-4o 在 USMLE 学科、临床实习和临床技能方面的表现比其前代产品有了大幅提高，这表明该技术作为医学生教育辅助工具的巨大潜力。这些发现强调了在将 LLMs 纳入医学教育时需要慎重考虑的问题，强调了结构化课程对指导其合理使用的重要性，以及持续进行关键分析以确保其可靠性和有效性的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊