ChatGPT and Gemini Are Not Consistently Concordant With the 2020 American Academy of Orthopaedic Surgeons Clinical Practice Guidelines When Evaluating Rotator Cuff Injury.
Michael Megafu, Omar Guerrero, Avanish Yendluri, Bradford O Parsons, Leesa M Galatz, Xinning Li, John D Kelly, Robert L Parisien
{"title":"ChatGPT and Gemini Are Not Consistently Concordant With the 2020 American Academy of Orthopaedic Surgeons Clinical Practice Guidelines When Evaluating Rotator Cuff Injury.","authors":"Michael Megafu, Omar Guerrero, Avanish Yendluri, Bradford O Parsons, Leesa M Galatz, Xinning Li, John D Kelly, Robert L Parisien","doi":"10.1016/j.arthro.2025.01.039","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To evaluate the accuracy of suggestions given by ChatGPT and Gemini (previously known as \"Bard\"), 2 widely used publicly available large language models, to evaluate the management of rotator cuff injuries.</p><p><strong>Methods: </strong>The 2020 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPGs) were the basis for determining recommended and non-recommended treatments in this study. ChatGPT and Gemini were queried on 16 treatments based on these guidelines examining rotator cuff interventions. The responses were categorized as \"concordant\" or \"discordant\" with the AAOS CPGs. The Cohen κ coefficient was calculated to assess inter-rater reliability.</p><p><strong>Results: </strong>ChatGPT and Gemini showed concordance with the AAOS CPGs for 13 of the 16 treatments queried (81%) and 12 of the 16 treatments queried (75%), respectively. ChatGPT provided discordant responses with the AAOS CPGs for 3 treatments (19%), whereas Gemini provided discordant responses for 4 treatments (25%). Assessment of inter-rater reliability showed a Cohen κ coefficient of 0.98, signifying agreement between the raters in classifying the responses of ChatGPT and Gemini to the AAOS CPGs as being concordant or discordant.</p><p><strong>Conclusions: </strong>ChatGPT and Gemini do not consistently provide responses that align with the AAOS CPGs.</p><p><strong>Clinical relevance: </strong>This study provides evidence that cautions patients not to rely solely on artificial intelligence for recommendations about rotator cuff injuries.</p>","PeriodicalId":55459,"journal":{"name":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","volume":" ","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.arthro.2025.01.039","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: To evaluate the accuracy of suggestions given by ChatGPT and Gemini (previously known as "Bard"), 2 widely used publicly available large language models, to evaluate the management of rotator cuff injuries.
Methods: The 2020 American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPGs) were the basis for determining recommended and non-recommended treatments in this study. ChatGPT and Gemini were queried on 16 treatments based on these guidelines examining rotator cuff interventions. The responses were categorized as "concordant" or "discordant" with the AAOS CPGs. The Cohen κ coefficient was calculated to assess inter-rater reliability.
Results: ChatGPT and Gemini showed concordance with the AAOS CPGs for 13 of the 16 treatments queried (81%) and 12 of the 16 treatments queried (75%), respectively. ChatGPT provided discordant responses with the AAOS CPGs for 3 treatments (19%), whereas Gemini provided discordant responses for 4 treatments (25%). Assessment of inter-rater reliability showed a Cohen κ coefficient of 0.98, signifying agreement between the raters in classifying the responses of ChatGPT and Gemini to the AAOS CPGs as being concordant or discordant.
Conclusions: ChatGPT and Gemini do not consistently provide responses that align with the AAOS CPGs.
Clinical relevance: This study provides evidence that cautions patients not to rely solely on artificial intelligence for recommendations about rotator cuff injuries.
期刊介绍:
Nowhere is minimally invasive surgery explained better than in Arthroscopy, the leading peer-reviewed journal in the field. Every issue enables you to put into perspective the usefulness of the various emerging arthroscopic techniques. The advantages and disadvantages of these methods -- along with their applications in various situations -- are discussed in relation to their efficiency, efficacy and cost benefit. As a special incentive, paid subscribers also receive access to the journal expanded website.