ChatGPT-Generated Responses Across Orthopaedic Sports Medicine Surgery Vary in Accuracy, Quality, and Readability: A Systematic Review

Q3 Medicine

Arthroscopy Sports Medicine and Rehabilitation Pub Date : 2025-08-01 DOI:10.1016/j.asmr.2025.101210

Jacob D. Kodra B.S. , Arthur Saroyan B.S. , Fabrizio Darby B.S. , Serkan Surucu M.D. , Scott Fong B.A. , Stephen Gillinov B.A. , Kevin Girardi B.A. , Rajiv Vasudevan M.D. , Jeremy K. Ansah-Twum M.D. , Louise Atadja M.D. , Jay Moran M.D. , Andrew E. Jimenez M.D.

{"title":"ChatGPT-Generated Responses Across Orthopaedic Sports Medicine Surgery Vary in Accuracy, Quality, and Readability: A Systematic Review","authors":"Jacob D. Kodra B.S. , Arthur Saroyan B.S. , Fabrizio Darby B.S. , Serkan Surucu M.D. , Scott Fong B.A. , Stephen Gillinov B.A. , Kevin Girardi B.A. , Rajiv Vasudevan M.D. , Jeremy K. Ansah-Twum M.D. , Louise Atadja M.D. , Jay Moran M.D. , Andrew E. Jimenez M.D.","doi":"10.1016/j.asmr.2025.101210","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate the current literature regarding the accuracy and efficacy of ChatGPT in delivering patient education on common orthopaedic sports medicine operations.</div></div><div><h3>Methods</h3><div>A systematic review was performed in accordance with Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines. After PROSPERO registration, a keyword search was conducted in the PubMed, Cochrane Central Register of Controlled Trials, and Scopus databases in September 2024. Articles were included if they evaluated ChatGPT’s performance against established sources, examined ChatGPT’s ability to provide counseling related to orthopaedic sports medicine operations, and assessed ChatGPT’s quality of responses. Primary outcomes assessed were quality of written content (e.g., DISCERN score), readability (e.g., Flesch-Kincaid Grade Level and Flesch-Kincaid Reading Ease Score), and reliability (<em>Journal of the American Medical Association</em> Benchmark Criteria).</div></div><div><h3>Results</h3><div>Seventeen articles satisfied the inclusion and exclusion criteria and formed the basis of this review. Four studies compared the effectiveness of ChatGPT and Google, and another study compared ChatGPT-3.5 with ChatGPT-4. ChatGPT provided moderate- to high-quality responses (mean DISCERN score, 41.0-62.1), with strong inter-rater reliability (0.72-0.91). Readability analyses showed that responses were written at a high school to college reading level (mean Flesch-Kincaid Grade Level, 10.3-16.0) and were generally difficult to read (mean Flesch-Kincaid Reading Ease Score, 28.1-48.0). ChatGPT frequently lacked source citations, resulting in a poor reliability score across all studies (mean <em>Journal of the American Medical Association</em> score, 0). Compared with Google, ChatGPT-4 generally provided higher-quality responses. ChatGPT also displayed limited source transparency unless specifically prompted for sources. ChatGPT-4 outperformed ChatGPT-3.5 in response quality (DISCERN score, 3.86 [95% confidence interval, 3.79-3.93] vs 3.46 [95% confidence interval, 3.40-3.54]; <em>P</em> = .01) and readability.</div></div><div><h3>Conclusions</h3><div>ChatGPT provides generally satisfactory responses to patient questions regarding orthopaedic sports medicine operations. However, its utility remains limited by challenges with source attribution, high reading complexity, and variability in accuracy.</div></div><div><h3>Level of Evidence</h3><div>Level V, systematic review of Level V studies.</div></div>","PeriodicalId":34631,"journal":{"name":"Arthroscopy Sports Medicine and Rehabilitation","volume":"7 4","pages":"Article 101210"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy Sports Medicine and Rehabilitation","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666061X25001361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

To evaluate the current literature regarding the accuracy and efficacy of ChatGPT in delivering patient education on common orthopaedic sports medicine operations.

Methods

A systematic review was performed in accordance with Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines. After PROSPERO registration, a keyword search was conducted in the PubMed, Cochrane Central Register of Controlled Trials, and Scopus databases in September 2024. Articles were included if they evaluated ChatGPT’s performance against established sources, examined ChatGPT’s ability to provide counseling related to orthopaedic sports medicine operations, and assessed ChatGPT’s quality of responses. Primary outcomes assessed were quality of written content (e.g., DISCERN score), readability (e.g., Flesch-Kincaid Grade Level and Flesch-Kincaid Reading Ease Score), and reliability (Journal of the American Medical Association Benchmark Criteria).

Results

Seventeen articles satisfied the inclusion and exclusion criteria and formed the basis of this review. Four studies compared the effectiveness of ChatGPT and Google, and another study compared ChatGPT-3.5 with ChatGPT-4. ChatGPT provided moderate- to high-quality responses (mean DISCERN score, 41.0-62.1), with strong inter-rater reliability (0.72-0.91). Readability analyses showed that responses were written at a high school to college reading level (mean Flesch-Kincaid Grade Level, 10.3-16.0) and were generally difficult to read (mean Flesch-Kincaid Reading Ease Score, 28.1-48.0). ChatGPT frequently lacked source citations, resulting in a poor reliability score across all studies (mean Journal of the American Medical Association score, 0). Compared with Google, ChatGPT-4 generally provided higher-quality responses. ChatGPT also displayed limited source transparency unless specifically prompted for sources. ChatGPT-4 outperformed ChatGPT-3.5 in response quality (DISCERN score, 3.86 [95% confidence interval, 3.79-3.93] vs 3.46 [95% confidence interval, 3.40-3.54]; P = .01) and readability.

Conclusions

ChatGPT provides generally satisfactory responses to patient questions regarding orthopaedic sports medicine operations. However, its utility remains limited by challenges with source attribution, high reading complexity, and variability in accuracy.

Level of Evidence

Level V, systematic review of Level V studies.

查看原文本刊更多论文

在骨科运动医学手术中，chatgpt产生的反应在准确性、质量和可读性上各不相同：一项系统综述

目的评价目前文献中关于ChatGPT在骨科运动医学普通手术患者教育中的准确性和有效性。方法按照系统评价和荟萃分析指南的首选报告项目进行系统评价。在PROSPERO注册后，于2024年9月在PubMed、Cochrane Central Register of Controlled Trials和Scopus数据库中进行关键词搜索。如果文章评估了ChatGPT与现有资源的表现，检查了ChatGPT提供与骨科运动医学手术相关的咨询的能力，并评估了ChatGPT的响应质量，则纳入文章。评估的主要结果是书面内容的质量（例如，DISCERN评分）、可读性（例如，Flesch-Kincaid Grade Level和Flesch-Kincaid阅读易用性评分）和可靠性（美国医学协会基准标准杂志）。结果17篇文章符合纳入和排除标准，构成了本综述的基础。四项研究比较了ChatGPT和谷歌的有效性，另一项研究比较了ChatGPT-3.5和ChatGPT-4的有效性。ChatGPT提供了中等到高质量的反应（平均辨别评分，41.0-62.1），具有很强的评分间信度（0.72-0.91）。可读性分析显示，受访者的阅读水平介于高中至大学之间（平均Flesch-Kincaid Grade level， 10.3-16.0），阅读难度普遍较高（平均Flesch-Kincaid reading Ease Score， 28.1-48.0）。ChatGPT经常缺乏源引用，导致所有研究的可靠性评分较低（美国医学会杂志的平均评分为0）。与谷歌相比，ChatGPT-4总体上提供了更高质量的响应。ChatGPT还显示有限的源代码透明度，除非特别提示源代码。ChatGPT-4在响应质量（DISCERN评分，3.86[95%可信区间，3.79-3.93]vs 3.46[95%可信区间，3.40-3.54];P = 0.01）和可读性上优于ChatGPT-3.5。结论对骨科运动医学手术患者的相关问题进行了较为满意的回答。然而，它的实用性仍然受到来源归属、高读取复杂性和准确性可变性的挑战的限制。证据水平：V级，对V级研究的系统评价。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊