chatgpt - 40不是骨科住院医师可靠的研究来源。

IF 3.8 Q2 ORTHOPEDICS
JBJS Open Access Pub Date : 2025-09-11 eCollection Date: 2025-07-01 DOI:10.2106/JBJS.OA.25.00112
Neil Jain, Caleb Gottlich, John Fisher, Travis Winston, Kristofer Matullo, Dustin Greenhill
{"title":"chatgpt - 40不是骨科住院医师可靠的研究来源。","authors":"Neil Jain, Caleb Gottlich, John Fisher, Travis Winston, Kristofer Matullo, Dustin Greenhill","doi":"10.2106/JBJS.OA.25.00112","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The use of artificial intelligence platforms by medical residents as an educational resource is increasing. Within orthopaedic surgery, older Chat Generative Pre-trained Transformer (ChatGPT) models performed worse than resident physicians on practice examinations and rarely answered questions with images correctly. The newer ChatGPT-4o was designed to improve these deficiencies but has not been evaluated. This study analyzed (1) ChatGPT-4o's ability to correctly answer Orthopaedic In-Training Examination (OITE) questions and (2) the educational quality of the answer explanations that it presents to our orthopaedic surgery trainees.</p><p><strong>Methods: </strong>The 2020 to 2022 OITEs were uploaded into ChatGPT-4o. Annual score reports were used to compare the chatbot's raw score with that of ACGME-accredited orthopaedic residents. ChatGPT-4o's answer explanations were then compared with those provided by the American Academy of Orthopaedic Surgeons (AAOS) and categorized based on (1) the chatbot's answer (correct/incorrect) and (2) the chatbot's answer explanation when compared with the explanation provided by AAOS subject-matter experts (classified as consistent, disparate, or nonexistent). Overall ChatGPT-4o response quality was then simplified into 3 groups. An \"ideal\" response combined a correct answer with a consistent explanation. \"Inadequate\" responses provided a correct answer but no explanation. \"Unacceptable\" responses provided an incorrect answer or disparate explanation.</p><p><strong>Results: </strong>ChatGPT-4o scored 68.8%, 63.4%, and 70.1% on the 2020, 2021, and 2022 OITEs, respectively. These raw scores corresponded with ACGME-accredited postgraduate year-5 (PGY-5), PGY2-3, and PGY-4 resident physicians. Pediatrics and Spine were the only subspecialties whereby ChatGPT-4o consistently performed better than a junior resident (≥PGY-3). The quality of responses provided by ChatGPT-4o was ideal, inadequate, or unacceptable in 58.7%, 6.9%, and 34.4% of questions, respectively. ChatGPT-4o scored significantly lower on media-related questions when compared with nonmedia questions (60.0% versus 73.1%, p < 0.001).</p><p><strong>Conclusions: </strong>ChatGPT-4o performed inconsistently on the OITE. Moreover, the responses it provided trainees were not always ideal. Its limited performance on media-based orthopaedic surgery questions also persisted. The use of ChatGPT by resident physicians while studying orthopaedic surgery concepts remains unvalidated.</p><p><strong>Level of evidence: </strong>Level IV. See Instructions for Authors for a complete description of levels of evidence.</p>","PeriodicalId":36492,"journal":{"name":"JBJS Open Access","volume":"10 3","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2025-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417002/pdf/","citationCount":"0","resultStr":"{\"title\":\"ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents.\",\"authors\":\"Neil Jain, Caleb Gottlich, John Fisher, Travis Winston, Kristofer Matullo, Dustin Greenhill\",\"doi\":\"10.2106/JBJS.OA.25.00112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The use of artificial intelligence platforms by medical residents as an educational resource is increasing. Within orthopaedic surgery, older Chat Generative Pre-trained Transformer (ChatGPT) models performed worse than resident physicians on practice examinations and rarely answered questions with images correctly. The newer ChatGPT-4o was designed to improve these deficiencies but has not been evaluated. This study analyzed (1) ChatGPT-4o's ability to correctly answer Orthopaedic In-Training Examination (OITE) questions and (2) the educational quality of the answer explanations that it presents to our orthopaedic surgery trainees.</p><p><strong>Methods: </strong>The 2020 to 2022 OITEs were uploaded into ChatGPT-4o. Annual score reports were used to compare the chatbot's raw score with that of ACGME-accredited orthopaedic residents. ChatGPT-4o's answer explanations were then compared with those provided by the American Academy of Orthopaedic Surgeons (AAOS) and categorized based on (1) the chatbot's answer (correct/incorrect) and (2) the chatbot's answer explanation when compared with the explanation provided by AAOS subject-matter experts (classified as consistent, disparate, or nonexistent). Overall ChatGPT-4o response quality was then simplified into 3 groups. An \\\"ideal\\\" response combined a correct answer with a consistent explanation. \\\"Inadequate\\\" responses provided a correct answer but no explanation. \\\"Unacceptable\\\" responses provided an incorrect answer or disparate explanation.</p><p><strong>Results: </strong>ChatGPT-4o scored 68.8%, 63.4%, and 70.1% on the 2020, 2021, and 2022 OITEs, respectively. These raw scores corresponded with ACGME-accredited postgraduate year-5 (PGY-5), PGY2-3, and PGY-4 resident physicians. Pediatrics and Spine were the only subspecialties whereby ChatGPT-4o consistently performed better than a junior resident (≥PGY-3). The quality of responses provided by ChatGPT-4o was ideal, inadequate, or unacceptable in 58.7%, 6.9%, and 34.4% of questions, respectively. ChatGPT-4o scored significantly lower on media-related questions when compared with nonmedia questions (60.0% versus 73.1%, p < 0.001).</p><p><strong>Conclusions: </strong>ChatGPT-4o performed inconsistently on the OITE. Moreover, the responses it provided trainees were not always ideal. Its limited performance on media-based orthopaedic surgery questions also persisted. The use of ChatGPT by resident physicians while studying orthopaedic surgery concepts remains unvalidated.</p><p><strong>Level of evidence: </strong>Level IV. See Instructions for Authors for a complete description of levels of evidence.</p>\",\"PeriodicalId\":36492,\"journal\":{\"name\":\"JBJS Open Access\",\"volume\":\"10 3\",\"pages\":\"\"},\"PeriodicalIF\":3.8000,\"publicationDate\":\"2025-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12417002/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JBJS Open Access\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2106/JBJS.OA.25.00112\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/7/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JBJS Open Access","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2106/JBJS.OA.25.00112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/7/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

摘要

背景:住院医师将人工智能平台作为一种教育资源的使用正在增加。在骨科手术中,较老的聊天生成预训练转换器(ChatGPT)模型在实习考试中的表现不如住院医师,并且很少正确回答带有图像的问题。较新的chatgpt - 40旨在改善这些缺陷,但尚未进行评估。本研究分析了(1)chatgpt - 40正确回答骨科培训考试(OITE)问题的能力和(2)它向我们的骨科学员提供的答案解释的教育质量。方法:将2020 ~ 2022年OITEs上传至chatgpt - 40。年度评分报告用于比较聊天机器人的原始评分与acgme认证的骨科住院医生的评分。然后将chatgpt - 40的答案解释与美国骨科医师学会(AAOS)提供的答案进行比较,并根据(1)聊天机器人的答案(正确/不正确)和(2)聊天机器人的答案解释与AAOS主题专家提供的解释(分类为一致、不同或不存在)进行分类。然后将chatgpt - 40的总体反应质量简化为3组。一个“理想”的回答结合了一个正确的答案和一个一致的解释。“不充分”的回答提供了正确的答案,但没有解释。“不可接受”的回答提供了不正确的答案或完全不同的解释。结果:chatgpt - 40在2020年、2021年和2022年的OITEs得分分别为68.8%、63.4%和70.1%。这些原始分数对应于acgme认证的研究生五年级(PGY-5), pgy - 2和PGY-4住院医师。儿科和脊柱是chatgpt - 40持续优于初级住院医师(≥PGY-3)的唯一亚专科。在58.7%、6.9%和34.4%的问题中,chatgpt - 40提供的回答质量为理想、不足或不可接受。chatgpt - 40在媒体相关问题上的得分明显低于非媒体问题(60.0%对73.1%,p < 0.001)。结论:chatgpt - 40在OITE上的表现不一致。此外,它向受训者提供的答复并不总是理想的。它在基于介质的骨科手术问题上的有限表现也依然存在。住院医师在学习骨科手术概念时使用ChatGPT仍未得到证实。证据等级:IV级。参见《作者说明》获得证据等级的完整描述。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents.

ChatGPT-4o is Not a Reliable Study Source for Orthopaedic Surgery Residents.

Background: The use of artificial intelligence platforms by medical residents as an educational resource is increasing. Within orthopaedic surgery, older Chat Generative Pre-trained Transformer (ChatGPT) models performed worse than resident physicians on practice examinations and rarely answered questions with images correctly. The newer ChatGPT-4o was designed to improve these deficiencies but has not been evaluated. This study analyzed (1) ChatGPT-4o's ability to correctly answer Orthopaedic In-Training Examination (OITE) questions and (2) the educational quality of the answer explanations that it presents to our orthopaedic surgery trainees.

Methods: The 2020 to 2022 OITEs were uploaded into ChatGPT-4o. Annual score reports were used to compare the chatbot's raw score with that of ACGME-accredited orthopaedic residents. ChatGPT-4o's answer explanations were then compared with those provided by the American Academy of Orthopaedic Surgeons (AAOS) and categorized based on (1) the chatbot's answer (correct/incorrect) and (2) the chatbot's answer explanation when compared with the explanation provided by AAOS subject-matter experts (classified as consistent, disparate, or nonexistent). Overall ChatGPT-4o response quality was then simplified into 3 groups. An "ideal" response combined a correct answer with a consistent explanation. "Inadequate" responses provided a correct answer but no explanation. "Unacceptable" responses provided an incorrect answer or disparate explanation.

Results: ChatGPT-4o scored 68.8%, 63.4%, and 70.1% on the 2020, 2021, and 2022 OITEs, respectively. These raw scores corresponded with ACGME-accredited postgraduate year-5 (PGY-5), PGY2-3, and PGY-4 resident physicians. Pediatrics and Spine were the only subspecialties whereby ChatGPT-4o consistently performed better than a junior resident (≥PGY-3). The quality of responses provided by ChatGPT-4o was ideal, inadequate, or unacceptable in 58.7%, 6.9%, and 34.4% of questions, respectively. ChatGPT-4o scored significantly lower on media-related questions when compared with nonmedia questions (60.0% versus 73.1%, p < 0.001).

Conclusions: ChatGPT-4o performed inconsistently on the OITE. Moreover, the responses it provided trainees were not always ideal. Its limited performance on media-based orthopaedic surgery questions also persisted. The use of ChatGPT by resident physicians while studying orthopaedic surgery concepts remains unvalidated.

Level of evidence: Level IV. See Instructions for Authors for a complete description of levels of evidence.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
JBJS Open Access
JBJS Open Access Medicine-Surgery
CiteScore
5.00
自引率
0.00%
发文量
77
审稿时长
6 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信