Joshua A Roshal, Caitlin Silvestri, Tejas Sathe, Courtney Townsend, V Suzanne Klimberg, Alexander Perez
{"title":"GPT-4作为委员会认证的外科医生:一项试点研究。","authors":"Joshua A Roshal, Caitlin Silvestri, Tejas Sathe, Courtney Townsend, V Suzanne Klimberg, Alexander Perez","doi":"10.1007/s40670-025-02352-5","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4's general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice.</p><p><strong>Methods: </strong>We tested GPT-4's ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy.</p><p><strong>Results: </strong>On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation.</p><p><strong>Conclusions: </strong>While GPT-4's high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.</p>","PeriodicalId":37113,"journal":{"name":"Medical Science Educator","volume":"35 3","pages":"1557-1566"},"PeriodicalIF":1.9000,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12228599/pdf/","citationCount":"0","resultStr":"{\"title\":\"GPT-4 as a Board-Certified Surgeon: A Pilot Study.\",\"authors\":\"Joshua A Roshal, Caitlin Silvestri, Tejas Sathe, Courtney Townsend, V Suzanne Klimberg, Alexander Perez\",\"doi\":\"10.1007/s40670-025-02352-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4's general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice.</p><p><strong>Methods: </strong>We tested GPT-4's ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy.</p><p><strong>Results: </strong>On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation.</p><p><strong>Conclusions: </strong>While GPT-4's high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.</p>\",\"PeriodicalId\":37113,\"journal\":{\"name\":\"Medical Science Educator\",\"volume\":\"35 3\",\"pages\":\"1557-1566\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-03-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12228599/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Medical Science Educator\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s40670-025-02352-5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/6/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"EDUCATION, SCIENTIFIC DISCIPLINES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Medical Science Educator","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s40670-025-02352-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"EDUCATION, SCIENTIFIC DISCIPLINES","Score":null,"Total":0}
GPT-4 as a Board-Certified Surgeon: A Pilot Study.
Purpose: Large language models (LLMs), such as GPT-4 (OpenAI; San Francisco, CA), are promising tools for surgical education. However, skepticism surrounding their accuracy and reliability remains a significant barrier to their widespread adoption. Although GPT-4 has demonstrated a remarkable ability to pass multiple-choice tests, its general surgery knowledge and clinical judgment in complex oral-based examinations are less clear. This study aims to evaluate GPT-4's general surgery knowledge on mock written and oral board-style examinations to drive improvements that will enable the tool to revolutionize surgical education and practice.
Methods: We tested GPT-4's ability to answer 250 random multiple-choice questions (MCQs) from the Surgical Council on Resident Education (SCORE) question bank and navigate 4 oral board scenarios derived from the Entrustable Professional Activities (EPA) topic list. Two former oral board examiners assessed the responses independently for accuracy.
Results: On MCQs, GPT-4 answered 197 out of 250 (78.8%) correctly, corresponding to a 92% probability of passing the American Board of Surgery Qualifying Examination (ABS QE). On oral board scenarios, GPT-4 committed critical failures in 3 out of 4 (75%) clinical cases. Common reasons for failure were incorrect timing of intervention and incorrect suggested operation.
Conclusions: While GPT-4's high performance on MCQs mirrored prior studies, the model struggled to generate accurate long-form content in our mock oral board examination. Future efforts should use specialized datasets and advanced reinforcement learning to improve LLM performance in complex, high-stakes clinical decision-making.
期刊介绍:
Medical Science Educator is the successor of the journal JIAMSE. It is the peer-reviewed publication of the International Association of Medical Science Educators (IAMSE). The Journal offers all who teach in healthcare the most current information to succeed in their task by publishing scholarly activities, opinions, and resources in medical science education. Published articles focus on teaching the sciences fundamental to modern medicine and health, and include basic science education, clinical teaching, and the use of modern education technologies. The Journal provides the readership a better understanding of teaching and learning techniques in order to advance medical science education.