Benjamin Nieves-Lopez, Alexandra R Bechtle, Jennifer Traverse, Christopher Klifto, Bradley S Schoch, Keith T Aziz
{"title":"Evaluating the Evolution of ChatGPT as an Information Resource in Shoulder and Elbow Surgery.","authors":"Benjamin Nieves-Lopez, Alexandra R Bechtle, Jennifer Traverse, Christopher Klifto, Bradley S Schoch, Keith T Aziz","doi":"10.3928/01477447-20250123-03","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The purpose of this study was to evaluate the performance and evolution of Chat Generative Pre-Trained Transformer (ChatGPT; OpenAI) as a resource for shoulder and elbow surgery information by assessing its accuracy on the American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. We hypothesized that both ChatGPT models would demonstrate proficiency and that there would be significant improvement with progressive iterations.</p><p><strong>Materials and methods: </strong>A total of 200 questions were selected from the 2019 and 2021 American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. ChatGPT 3.5 and 4 were used to evaluate all questions. Questions with non-text data were excluded (114 questions). Remaining questions were input into ChatGPT and categorized as follows: anatomy, arthroplasty, basic science, instability, miscellaneous, nonoperative, and trauma. ChatGPT's performances were quantified and compared across categories with chi-square tests. The continuing medical education credit threshold of 50% was used to determine proficiency. Statistical significance was set at <i>P</i><.05.</p><p><strong>Results: </strong>ChatGPT 3.5 and 4 answered 52.3% and 73.3% of the questions correctly, respectively (<i>P</i>=.003). ChatGPT 3.5 performed significantly better in the instability category (<i>P</i>=.037). ChatGPT 4's performance did not significantly differ across categories (<i>P</i>=.841). ChatGPT 4 performed significantly better than ChatGPT 3.5 in all categories except instability and miscellaneous.</p><p><strong>Conclusion: </strong>ChatGPT 3.5 and 4 exceeded the proficiency threshold. ChatGPT 4 performed better than ChatGPT 3.5, showing an increased capability to correctly answer shoulder and elbow-focused questions. Further refinement of ChatGPT's training may improve its performance and utility as a resource. Currently, ChatGPT remains unable to answer questions at a high enough accuracy to replace clinical decision-making. [<i>Orthopedics</i>. 202x;4x(x):xx-xx.].</p>","PeriodicalId":19631,"journal":{"name":"Orthopedics","volume":" ","pages":"1-6"},"PeriodicalIF":1.1000,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Orthopedics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.3928/01477447-20250123-03","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The purpose of this study was to evaluate the performance and evolution of Chat Generative Pre-Trained Transformer (ChatGPT; OpenAI) as a resource for shoulder and elbow surgery information by assessing its accuracy on the American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. We hypothesized that both ChatGPT models would demonstrate proficiency and that there would be significant improvement with progressive iterations.
Materials and methods: A total of 200 questions were selected from the 2019 and 2021 American Academy of Orthopaedic Surgeons shoulder-elbow self-assessment questions. ChatGPT 3.5 and 4 were used to evaluate all questions. Questions with non-text data were excluded (114 questions). Remaining questions were input into ChatGPT and categorized as follows: anatomy, arthroplasty, basic science, instability, miscellaneous, nonoperative, and trauma. ChatGPT's performances were quantified and compared across categories with chi-square tests. The continuing medical education credit threshold of 50% was used to determine proficiency. Statistical significance was set at P<.05.
Results: ChatGPT 3.5 and 4 answered 52.3% and 73.3% of the questions correctly, respectively (P=.003). ChatGPT 3.5 performed significantly better in the instability category (P=.037). ChatGPT 4's performance did not significantly differ across categories (P=.841). ChatGPT 4 performed significantly better than ChatGPT 3.5 in all categories except instability and miscellaneous.
Conclusion: ChatGPT 3.5 and 4 exceeded the proficiency threshold. ChatGPT 4 performed better than ChatGPT 3.5, showing an increased capability to correctly answer shoulder and elbow-focused questions. Further refinement of ChatGPT's training may improve its performance and utility as a resource. Currently, ChatGPT remains unable to answer questions at a high enough accuracy to replace clinical decision-making. [Orthopedics. 202x;4x(x):xx-xx.].
期刊介绍:
For over 40 years, Orthopedics, a bimonthly peer-reviewed journal, has been the preferred choice of orthopedic surgeons for clinically relevant information on all aspects of adult and pediatric orthopedic surgery and treatment. Edited by Robert D''Ambrosia, MD, Chairman of the Department of Orthopedics at the University of Colorado, Denver, and former President of the American Academy of Orthopaedic Surgeons, as well as an Editorial Board of over 100 international orthopedists, Orthopedics is the source to turn to for guidance in your practice.
The journal offers access to current articles, as well as several years of archived content. Highlights also include Blue Ribbon articles published full text in print and online, as well as Tips & Techniques posted with every issue.