GPT-4 as a Board-Certified Surgeon: A Pilot Study.

IF 1.9 Q2 EDUCATION, SCIENTIFIC DISCIPLINES
Medical Science Educator Pub Date : 2025-03-13 eCollection Date: 2025-06-01 DOI:10.1007/s40670-025-02352-5
Joshua A Roshal, Caitlin Silvestri, Tejas Sathe, Courtney Townsend, V Suzanne Klimberg, Alexander Perez
{"title":"GPT-4 as a Board-Certified Surgeon: A Pilot Study.","authors":"Joshua A Roshal, Caitlin Silvestri, Tejas Sathe, Courtney Townsend, V Suzanne Klimberg, Alexander Perez","doi":"10.1007/s40670-025-02352-5","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4's general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice.</p><p><strong>Methods: </strong>We tested GPT-4's ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy.</p><p><strong>Results: </strong>On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation.</p><p><strong>Conclusions: </strong>While GPT-4's high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.</p>","PeriodicalId":37113,"journal":{"name":"Medical Science Educator","volume":"35 3","pages":"1557-1566"},"PeriodicalIF":1.9000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12228599/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Science Educator","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s40670-025-02352-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose: Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4's general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice.

Methods: We tested GPT-4's ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy.

Results: On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation.

Conclusions: While GPT-4's high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.

GPT-4作为委员会认证的外科医生:一项试点研究。
目的:大型语言模型(llm),如GPT-4 (OpenAI;在外科教育中是很有前途的工具。然而,对其准确性和可靠性的怀疑仍然是其广泛采用的重大障碍。虽然GPT-4在通过多项选择题测试方面表现出了显著的能力,但其在复杂的口头考试中的普外科知识和临床判断却不太清楚。本研究旨在评估GPT-4在模拟书面和口头委员会式考试中的普通外科知识,以推动改进,使该工具能够彻底改变外科教育和实践。方法:我们测试了GPT-4回答外科住院医师教育委员会(SCORE)题库中250个随机选择题(mcq)的能力,以及从可信赖的专业活动(EPA)主题列表中获得的4个口头委员会场景的能力。两位前口头委员会考官独立评估回答的准确性。结果:在mcq上,GPT-4答对了250个问题中的197个(78.8%),对应于92%的概率通过了美国外科医师资格考试委员会(ABS QE)。在口服治疗方案中,GPT-4在4例临床病例中有3例(75%)出现严重失败。失败的常见原因是干预时机不正确和建议操作不正确。结论:虽然GPT-4在mcq上的高性能反映了之前的研究,但该模型在模拟口语考试中难以生成准确的长篇内容。未来的努力应该使用专门的数据集和先进的强化学习来提高LLM在复杂、高风险的临床决策中的表现。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Medical Science Educator
Medical Science Educator Social Sciences-Education
CiteScore
2.90
自引率
11.80%
发文量
202
期刊介绍: Medical Science Educator is the successor of the journal JIAMSE. It is the peer-reviewed publication of the International Association of Medical Science Educators (IAMSE). The Journal offers all who teach in healthcare the most current information to succeed in their task by publishing scholarly activities, opinions, and resources in medical science education. Published articles focus on teaching the sciences fundamental to modern medicine and health, and include basic science education, clinical teaching, and the use of modern education technologies. The Journal provides the readership a better understanding of teaching and learning techniques in order to advance medical science education.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信