ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents.

IF 3.8 Q2 ORTHOPEDICS

JBJS Open Access Pub Date : 2025-09-11 eCollection Date: 2025-07-01 DOI:10.2106/JBJS.OA.25.00112

Neil Jain, Caleb Gottlich, John Fisher, Travis Winston, Kristofer Matullo, Dustin Greenhill

{"title":"ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents.","authors":"Neil Jain, Caleb Gottlich, John Fisher, Travis Winston, Kristofer Matullo, Dustin Greenhill","doi":"10.2106/JBJS.OA.25.00112","DOIUrl":null,"url":null,"abstract":"Background: The use of artificial intelligence platforms by medical residents as an educational resource is increasing. Within orthopaedic surgery, older Chat Generative Pre-trained Transformer (ChatGPT) models performed worse than resident physicians on practice examinations and rarely answered questions with images correctly. The newer ChatGPT-4o was designed to improve these deficiencies but has not been evaluated. This study analyzed (1) ChatGPT-4o's ability to correctly answer Orthopaedic In-Training Examination (OITE) questions and (2) the educational quality of the answer explanations that it presents to our orthopaedic surgery trainees.Methods: The 2020 to 2022 OITEs were uploaded into ChatGPT-4o. Annual score reports were used to compare the chatbot's raw score with that of ACGME-accredited orthopaedic residents. ChatGPT-4o's answer explanations were then compared with those provided by the American Academy of Orthopaedic Surgeons (AAOS) and categorized based on (1) the chatbot's answer (correct/incorrect) and (2) the chatbot's answer explanation when compared with the explanation provided by AAOS subject-matter experts (classified as consistent, disparate, or nonexistent). Overall ChatGPT-4o response quality was then simplified into 3 groups. An \"ideal\" response combined a correct answer with a consistent explanation. \"Inadequate\" responses provided a correct answer but no explanation. \"Unacceptable\" responses provided an incorrect answer or disparate explanation.Results: ChatGPT-4o scored 68.8%, 63.4%, and 70.1% on the 2020, 2021, and 2022 OITEs, respectively. These raw scores corresponded with ACGME-accredited postgraduate year-5 (PGY-5), PGY2-3, and PGY-4 resident physicians. Pediatrics and Spine were the only subspecialties whereby ChatGPT-4o consistently performed better than a junior resident (≥PGY-3). The quality of responses provided by ChatGPT-4o was ideal, inadequate, or unacceptable in 58.7%, 6.9%, and 34.4% of questions, respectively. ChatGPT-4o scored significantly lower on media-related questions when compared with nonmedia questions (60.0% versus 73.1%, p < 0.001).Conclusions: ChatGPT-4o performed inconsistently on the OITE. Moreover, the responses it provided trainees were not always ideal. Its limited performance on media-based orthopaedic surgery questions also persisted. The use of ChatGPT by resident physicians while studying orthopaedic surgery concepts remains unvalidated.Level of evidence: Level IV. See Instructions for Authors for a complete description of levels of evidence.","PeriodicalId":36492,"journal":{"name":"JBJS Open Access","volume":"10 3","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417002/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JBJS Open Access","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2106/JBJS.OA.25.00112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The use of artificial intelligence platforms by medical residents as an educational resource is increasing. Within orthopaedic surgery, older Chat Generative Pre-trained Transformer (ChatGPT) models performed worse than resident physicians on practice examinations and rarely answered questions with images correctly. The newer ChatGPT-4o was designed to improve these deficiencies but has not been evaluated. This study analyzed (1) ChatGPT-4o's ability to correctly answer Orthopaedic In-Training Examination (OITE) questions and (2) the educational quality of the answer explanations that it presents to our orthopaedic surgery trainees.

Methods: The 2020 to 2022 OITEs were uploaded into ChatGPT-4o. Annual score reports were used to compare the chatbot's raw score with that of ACGME-accredited orthopaedic residents. ChatGPT-4o's answer explanations were then compared with those provided by the American Academy of Orthopaedic Surgeons (AAOS) and categorized based on (1) the chatbot's answer (correct/incorrect) and (2) the chatbot's answer explanation when compared with the explanation provided by AAOS subject-matter experts (classified as consistent, disparate, or nonexistent). Overall ChatGPT-4o response quality was then simplified into 3 groups. An "ideal" response combined a correct answer with a consistent explanation. "Inadequate" responses provided a correct answer but no explanation. "Unacceptable" responses provided an incorrect answer or disparate explanation.

Results: ChatGPT-4o scored 68.8%, 63.4%, and 70.1% on the 2020, 2021, and 2022 OITEs, respectively. These raw scores corresponded with ACGME-accredited postgraduate year-5 (PGY-5), PGY2-3, and PGY-4 resident physicians. Pediatrics and Spine were the only subspecialties whereby ChatGPT-4o consistently performed better than a junior resident (≥PGY-3). The quality of responses provided by ChatGPT-4o was ideal, inadequate, or unacceptable in 58.7%, 6.9%, and 34.4% of questions, respectively. ChatGPT-4o scored significantly lower on media-related questions when compared with nonmedia questions (60.0% versus 73.1%, p < 0.001).

Conclusions: ChatGPT-4o performed inconsistently on the OITE. Moreover, the responses it provided trainees were not always ideal. Its limited performance on media-based orthopaedic surgery questions also persisted. The use of ChatGPT by resident physicians while studying orthopaedic surgery concepts remains unvalidated.

Level of evidence: Level IV. See Instructions for Authors for a complete description of levels of evidence.

Abstract Image

查看原文本刊更多论文

chatgpt - 40不是骨科住院医师可靠的研究来源。

背景：住院医师将人工智能平台作为一种教育资源的使用正在增加。在骨科手术中，较老的聊天生成预训练转换器（ChatGPT）模型在实习考试中的表现不如住院医师，并且很少正确回答带有图像的问题。较新的chatgpt - 40旨在改善这些缺陷，但尚未进行评估。本研究分析了(1)chatgpt - 40正确回答骨科培训考试（OITE）问题的能力和(2)它向我们的骨科学员提供的答案解释的教育质量。方法：将2020 ~ 2022年OITEs上传至chatgpt - 40。年度评分报告用于比较聊天机器人的原始评分与acgme认证的骨科住院医生的评分。然后将chatgpt - 40的答案解释与美国骨科医师学会（AAOS）提供的答案进行比较，并根据(1)聊天机器人的答案（正确/不正确）和(2)聊天机器人的答案解释与AAOS主题专家提供的解释（分类为一致、不同或不存在）进行分类。然后将chatgpt - 40的总体反应质量简化为3组。一个“理想”的回答结合了一个正确的答案和一个一致的解释。“不充分”的回答提供了正确的答案，但没有解释。“不可接受”的回答提供了不正确的答案或完全不同的解释。结果：chatgpt - 40在2020年、2021年和2022年的OITEs得分分别为68.8%、63.4%和70.1%。这些原始分数对应于acgme认证的研究生五年级（PGY-5）， pgy - 2和PGY-4住院医师。儿科和脊柱是chatgpt - 40持续优于初级住院医师（≥PGY-3）的唯一亚专科。在58.7%、6.9%和34.4%的问题中，chatgpt - 40提供的回答质量为理想、不足或不可接受。chatgpt - 40在媒体相关问题上的得分明显低于非媒体问题（60.0%对73.1%,p < 0.001）。结论：chatgpt - 40在OITE上的表现不一致。此外，它向受训者提供的答复并不总是理想的。它在基于介质的骨科手术问题上的有限表现也依然存在。住院医师在学习骨科手术概念时使用ChatGPT仍未得到证实。证据等级：IV级。参见《作者说明》获得证据等级的完整描述。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊