Jacob D. Kodra B.S. , Arthur Saroyan B.S. , Fabrizio Darby B.S. , Serkan Surucu M.D. , Scott Fong B.A. , Stephen Gillinov B.A. , Kevin Girardi B.A. , Rajiv Vasudevan M.D. , Jeremy K. Ansah-Twum M.D. , Louise Atadja M.D. , Jay Moran M.D. , Andrew E. Jimenez M.D.
{"title":"ChatGPT-Generated Responses Across Orthopaedic Sports Medicine Surgery Vary in Accuracy, Quality, and Readability: A Systematic Review","authors":"Jacob D. Kodra B.S. , Arthur Saroyan B.S. , Fabrizio Darby B.S. , Serkan Surucu M.D. , Scott Fong B.A. , Stephen Gillinov B.A. , Kevin Girardi B.A. , Rajiv Vasudevan M.D. , Jeremy K. Ansah-Twum M.D. , Louise Atadja M.D. , Jay Moran M.D. , Andrew E. Jimenez M.D.","doi":"10.1016/j.asmr.2025.101210","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate the current literature regarding the accuracy and efficacy of ChatGPT in delivering patient education on common orthopaedic sports medicine operations.</div></div><div><h3>Methods</h3><div>A systematic review was performed in accordance with Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines. After PROSPERO registration, a keyword search was conducted in the PubMed, Cochrane Central Register of Controlled Trials, and Scopus databases in September 2024. Articles were included if they evaluated ChatGPT’s performance against established sources, examined ChatGPT’s ability to provide counseling related to orthopaedic sports medicine operations, and assessed ChatGPT’s quality of responses. Primary outcomes assessed were quality of written content (e.g., DISCERN score), readability (e.g., Flesch-Kincaid Grade Level and Flesch-Kincaid Reading Ease Score), and reliability (<em>Journal of the American Medical Association</em> Benchmark Criteria).</div></div><div><h3>Results</h3><div>Seventeen articles satisfied the inclusion and exclusion criteria and formed the basis of this review. Four studies compared the effectiveness of ChatGPT and Google, and another study compared ChatGPT-3.5 with ChatGPT-4. ChatGPT provided moderate- to high-quality responses (mean DISCERN score, 41.0-62.1), with strong inter-rater reliability (0.72-0.91). Readability analyses showed that responses were written at a high school to college reading level (mean Flesch-Kincaid Grade Level, 10.3-16.0) and were generally difficult to read (mean Flesch-Kincaid Reading Ease Score, 28.1-48.0). ChatGPT frequently lacked source citations, resulting in a poor reliability score across all studies (mean <em>Journal of the American Medical Association</em> score, 0). Compared with Google, ChatGPT-4 generally provided higher-quality responses. ChatGPT also displayed limited source transparency unless specifically prompted for sources. ChatGPT-4 outperformed ChatGPT-3.5 in response quality (DISCERN score, 3.86 [95% confidence interval, 3.79-3.93] vs 3.46 [95% confidence interval, 3.40-3.54]; <em>P</em> = .01) and readability.</div></div><div><h3>Conclusions</h3><div>ChatGPT provides generally satisfactory responses to patient questions regarding orthopaedic sports medicine operations. However, its utility remains limited by challenges with source attribution, high reading complexity, and variability in accuracy.</div></div><div><h3>Level of Evidence</h3><div>Level V, systematic review of Level V studies.</div></div>","PeriodicalId":34631,"journal":{"name":"Arthroscopy Sports Medicine and Rehabilitation","volume":"7 4","pages":"Article 101210"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy Sports Medicine and Rehabilitation","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666061X25001361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
To evaluate the current literature regarding the accuracy and efficacy of ChatGPT in delivering patient education on common orthopaedic sports medicine operations.
Methods
A systematic review was performed in accordance with Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines. After PROSPERO registration, a keyword search was conducted in the PubMed, Cochrane Central Register of Controlled Trials, and Scopus databases in September 2024. Articles were included if they evaluated ChatGPT’s performance against established sources, examined ChatGPT’s ability to provide counseling related to orthopaedic sports medicine operations, and assessed ChatGPT’s quality of responses. Primary outcomes assessed were quality of written content (e.g., DISCERN score), readability (e.g., Flesch-Kincaid Grade Level and Flesch-Kincaid Reading Ease Score), and reliability (Journal of the American Medical Association Benchmark Criteria).
Results
Seventeen articles satisfied the inclusion and exclusion criteria and formed the basis of this review. Four studies compared the effectiveness of ChatGPT and Google, and another study compared ChatGPT-3.5 with ChatGPT-4. ChatGPT provided moderate- to high-quality responses (mean DISCERN score, 41.0-62.1), with strong inter-rater reliability (0.72-0.91). Readability analyses showed that responses were written at a high school to college reading level (mean Flesch-Kincaid Grade Level, 10.3-16.0) and were generally difficult to read (mean Flesch-Kincaid Reading Ease Score, 28.1-48.0). ChatGPT frequently lacked source citations, resulting in a poor reliability score across all studies (mean Journal of the American Medical Association score, 0). Compared with Google, ChatGPT-4 generally provided higher-quality responses. ChatGPT also displayed limited source transparency unless specifically prompted for sources. ChatGPT-4 outperformed ChatGPT-3.5 in response quality (DISCERN score, 3.86 [95% confidence interval, 3.79-3.93] vs 3.46 [95% confidence interval, 3.40-3.54]; P = .01) and readability.
Conclusions
ChatGPT provides generally satisfactory responses to patient questions regarding orthopaedic sports medicine operations. However, its utility remains limited by challenges with source attribution, high reading complexity, and variability in accuracy.