Critical Analysis of ChatGPT 4 Omni in USMLE Disciplines, Clinical Clerkships, and Clinical Skills.

IF 3.2 Q1 EDUCATION, SCIENTIFIC DISCIPLINES

JMIR Medical Education Pub Date : 2024-09-14 DOI:10.2196/63430

Brenton T Bicknell, Danner Butler, Sydney Whalen, James Ricks, Cory J Dixon, Abigail B Clark, Olivia Spaedy, Adam Skelton, Neel Edupuganti, Lance Dzubinski, Hudson Tate, Garrett Dyess, Brenessa Lindeman, Lisa Soleymani Lehmann

{"title":"Critical Analysis of ChatGPT 4 Omni in USMLE Disciplines, Clinical Clerkships, and Clinical Skills.","authors":"Brenton T Bicknell, Danner Butler, Sydney Whalen, James Ricks, Cory J Dixon, Abigail B Clark, Olivia Spaedy, Adam Skelton, Neel Edupuganti, Lance Dzubinski, Hudson Tate, Garrett Dyess, Brenessa Lindeman, Lisa Soleymani Lehmann","doi":"10.2196/63430","DOIUrl":null,"url":null,"abstract":"Background: Recent studies, including those by the National Board of Medical Examiners (NBME), have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of these models' performance in specific medical content areas, thus limiting an assessment of their potential utility for medical education.Objective: To assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management.Methods: This study used 750 clinical vignette-based multiple-choice questions (MCQs) to characterize the performance of successive ChatGPT versions [ChatGPT 3.5 (GPT-3.5), ChatGPT 4 (GPT-4), and ChatGPT 4 Omni (GPT-4o)] across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances.Results: GPT-4o achieved the highest accuracy across 750 MCQs at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0% respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI: 58.3-60.3).Conclusions: ChatGPT 4 Omni's performance in USMLE preclinical content areas as well as clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the necessity of careful consideration of LLMs' integration into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.Clinicaltrial: ","PeriodicalId":36236,"journal":{"name":"JMIR Medical Education","volume":" ","pages":""},"PeriodicalIF":3.2000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/63430","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Recent studies, including those by the National Board of Medical Examiners (NBME), have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of these models' performance in specific medical content areas, thus limiting an assessment of their potential utility for medical education.

Objective: To assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management.

Methods: This study used 750 clinical vignette-based multiple-choice questions (MCQs) to characterize the performance of successive ChatGPT versions [ChatGPT 3.5 (GPT-3.5), ChatGPT 4 (GPT-4), and ChatGPT 4 Omni (GPT-4o)] across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models' performances.

Results: GPT-4o achieved the highest accuracy across 750 MCQs at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0% respectively. GPT-4o's highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o's diagnostic accuracy was 92.7% and management accuracy 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI: 58.3-60.3).

Conclusions: ChatGPT 4 Omni's performance in USMLE preclinical content areas as well as clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the necessity of careful consideration of LLMs' integration into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness.

Clinicaltrial:

查看原文本刊更多论文

ChatGPT 4 Omni 在 USMLE 学科、临床实习和临床技能中的批判性分析。

背景：最近的研究，包括美国国家医学考试委员会（NBME）的研究，都强调了最近的大型语言模型（LLM），如 ChatGPT，在通过美国医学执业资格考试（USMLE）方面的卓越能力。然而，对这些模型在特定医学内容领域的表现进行详细分析还存在差距，从而限制了对其在医学教育中潜在作用的评估：目的：评估并比较历代 ChatGPT 版本（GPT-3.5、GPT-4 和 GPT-4 Omni）在 USMLE 学科、临床实习以及诊断和管理临床技能方面的准确性：本研究使用了 750 道基于临床小故事的选择题（MCQ），以描述连续版本的 ChatGPT [ChatGPT 3.5 (GPT-3.5)、ChatGPT 4 (GPT-4) 和 ChatGPT 4 Omni (GPT-4o)]在 USMLE 学科、临床实习和临床技能（诊断和管理）中的表现。采用标准化方案对准确性进行评估，并进行统计分析以比较模型的性能：结果：在 750 个 MCQ 中，GPT-4o 的准确率最高，达到 90.4%，超过了 GPT-4 和 GPT-3.5，后者的准确率分别为 81.1% 和 60.0%。GPT-4o 在社会科学（95.5%）、行为与神经科学（94.2%）和药理学（93.2%）方面表现最佳。在临床技能方面，GPT-4o 的诊断准确率为 92.7%，管理准确率为 88.8%，明显高于其前身。值得注意的是，GPT-4o 和 GPT-4 都明显高于医学生 59.3% 的平均准确率（95% CI：58.3-60.3）：ChatGPT 4 Omni 在 USMLE 临床前内容领域和临床技能方面的表现表明，它比之前的版本有了很大的改进，这表明将该技术用作医学生教育辅助工具具有很大的潜力。这些研究结果突出表明，有必要认真考虑将 LLMs 纳入医学教育，强调结构化课程的重要性，以指导其适当使用，并需要持续进行关键分析，以确保其可靠性和有效性：

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊