ChatGPT-Generated Responses Across Orthopaedic Sports Medicine Surgery Vary in Accuracy, Quality, and Readability: A Systematic Review

Q3 Medicine
Jacob D. Kodra B.S. , Arthur Saroyan B.S. , Fabrizio Darby B.S. , Serkan Surucu M.D. , Scott Fong B.A. , Stephen Gillinov B.A. , Kevin Girardi B.A. , Rajiv Vasudevan M.D. , Jeremy K. Ansah-Twum M.D. , Louise Atadja M.D. , Jay Moran M.D. , Andrew E. Jimenez M.D.
{"title":"ChatGPT-Generated Responses Across Orthopaedic Sports Medicine Surgery Vary in Accuracy, Quality, and Readability: A Systematic Review","authors":"Jacob D. Kodra B.S. ,&nbsp;Arthur Saroyan B.S. ,&nbsp;Fabrizio Darby B.S. ,&nbsp;Serkan Surucu M.D. ,&nbsp;Scott Fong B.A. ,&nbsp;Stephen Gillinov B.A. ,&nbsp;Kevin Girardi B.A. ,&nbsp;Rajiv Vasudevan M.D. ,&nbsp;Jeremy K. Ansah-Twum M.D. ,&nbsp;Louise Atadja M.D. ,&nbsp;Jay Moran M.D. ,&nbsp;Andrew E. Jimenez M.D.","doi":"10.1016/j.asmr.2025.101210","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><div>To evaluate the current literature regarding the accuracy and efficacy of ChatGPT in delivering patient education on common orthopaedic sports medicine operations.</div></div><div><h3>Methods</h3><div>A systematic review was performed in accordance with Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines. After PROSPERO registration, a keyword search was conducted in the PubMed, Cochrane Central Register of Controlled Trials, and Scopus databases in September 2024. Articles were included if they evaluated ChatGPT’s performance against established sources, examined ChatGPT’s ability to provide counseling related to orthopaedic sports medicine operations, and assessed ChatGPT’s quality of responses. Primary outcomes assessed were quality of written content (e.g., DISCERN score), readability (e.g., Flesch-Kincaid Grade Level and Flesch-Kincaid Reading Ease Score), and reliability (<em>Journal of the American Medical Association</em> Benchmark Criteria).</div></div><div><h3>Results</h3><div>Seventeen articles satisfied the inclusion and exclusion criteria and formed the basis of this review. Four studies compared the effectiveness of ChatGPT and Google, and another study compared ChatGPT-3.5 with ChatGPT-4. ChatGPT provided moderate- to high-quality responses (mean DISCERN score, 41.0-62.1), with strong inter-rater reliability (0.72-0.91). Readability analyses showed that responses were written at a high school to college reading level (mean Flesch-Kincaid Grade Level, 10.3-16.0) and were generally difficult to read (mean Flesch-Kincaid Reading Ease Score, 28.1-48.0). ChatGPT frequently lacked source citations, resulting in a poor reliability score across all studies (mean <em>Journal of the American Medical Association</em> score, 0). Compared with Google, ChatGPT-4 generally provided higher-quality responses. ChatGPT also displayed limited source transparency unless specifically prompted for sources. ChatGPT-4 outperformed ChatGPT-3.5 in response quality (DISCERN score, 3.86 [95% confidence interval, 3.79-3.93] vs 3.46 [95% confidence interval, 3.40-3.54]; <em>P</em> = .01) and readability.</div></div><div><h3>Conclusions</h3><div>ChatGPT provides generally satisfactory responses to patient questions regarding orthopaedic sports medicine operations. However, its utility remains limited by challenges with source attribution, high reading complexity, and variability in accuracy.</div></div><div><h3>Level of Evidence</h3><div>Level V, systematic review of Level V studies.</div></div>","PeriodicalId":34631,"journal":{"name":"Arthroscopy Sports Medicine and Rehabilitation","volume":"7 4","pages":"Article 101210"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy Sports Medicine and Rehabilitation","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666061X25001361","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose

To evaluate the current literature regarding the accuracy and efficacy of ChatGPT in delivering patient education on common orthopaedic sports medicine operations.

Methods

A systematic review was performed in accordance with Preferred Reporting Items for Systematic Reviews and Meta-analyses guidelines. After PROSPERO registration, a keyword search was conducted in the PubMed, Cochrane Central Register of Controlled Trials, and Scopus databases in September 2024. Articles were included if they evaluated ChatGPT’s performance against established sources, examined ChatGPT’s ability to provide counseling related to orthopaedic sports medicine operations, and assessed ChatGPT’s quality of responses. Primary outcomes assessed were quality of written content (e.g., DISCERN score), readability (e.g., Flesch-Kincaid Grade Level and Flesch-Kincaid Reading Ease Score), and reliability (Journal of the American Medical Association Benchmark Criteria).

Results

Seventeen articles satisfied the inclusion and exclusion criteria and formed the basis of this review. Four studies compared the effectiveness of ChatGPT and Google, and another study compared ChatGPT-3.5 with ChatGPT-4. ChatGPT provided moderate- to high-quality responses (mean DISCERN score, 41.0-62.1), with strong inter-rater reliability (0.72-0.91). Readability analyses showed that responses were written at a high school to college reading level (mean Flesch-Kincaid Grade Level, 10.3-16.0) and were generally difficult to read (mean Flesch-Kincaid Reading Ease Score, 28.1-48.0). ChatGPT frequently lacked source citations, resulting in a poor reliability score across all studies (mean Journal of the American Medical Association score, 0). Compared with Google, ChatGPT-4 generally provided higher-quality responses. ChatGPT also displayed limited source transparency unless specifically prompted for sources. ChatGPT-4 outperformed ChatGPT-3.5 in response quality (DISCERN score, 3.86 [95% confidence interval, 3.79-3.93] vs 3.46 [95% confidence interval, 3.40-3.54]; P = .01) and readability.

Conclusions

ChatGPT provides generally satisfactory responses to patient questions regarding orthopaedic sports medicine operations. However, its utility remains limited by challenges with source attribution, high reading complexity, and variability in accuracy.

Level of Evidence

Level V, systematic review of Level V studies.
在骨科运动医学手术中,chatgpt产生的反应在准确性、质量和可读性上各不相同:一项系统综述
目的评价目前文献中关于ChatGPT在骨科运动医学普通手术患者教育中的准确性和有效性。方法按照系统评价和荟萃分析指南的首选报告项目进行系统评价。在PROSPERO注册后,于2024年9月在PubMed、Cochrane Central Register of Controlled Trials和Scopus数据库中进行关键词搜索。如果文章评估了ChatGPT与现有资源的表现,检查了ChatGPT提供与骨科运动医学手术相关的咨询的能力,并评估了ChatGPT的响应质量,则纳入文章。评估的主要结果是书面内容的质量(例如,DISCERN评分)、可读性(例如,Flesch-Kincaid Grade Level和Flesch-Kincaid阅读易用性评分)和可靠性(美国医学协会基准标准杂志)。结果17篇文章符合纳入和排除标准,构成了本综述的基础。四项研究比较了ChatGPT和谷歌的有效性,另一项研究比较了ChatGPT-3.5和ChatGPT-4的有效性。ChatGPT提供了中等到高质量的反应(平均辨别评分,41.0-62.1),具有很强的评分间信度(0.72-0.91)。可读性分析显示,受访者的阅读水平介于高中至大学之间(平均Flesch-Kincaid Grade level, 10.3-16.0),阅读难度普遍较高(平均Flesch-Kincaid reading Ease Score, 28.1-48.0)。ChatGPT经常缺乏源引用,导致所有研究的可靠性评分较低(美国医学会杂志的平均评分为0)。与谷歌相比,ChatGPT-4总体上提供了更高质量的响应。ChatGPT还显示有限的源代码透明度,除非特别提示源代码。ChatGPT-4在响应质量(DISCERN评分,3.86[95%可信区间,3.79-3.93]vs 3.46[95%可信区间,3.40-3.54];P = 0.01)和可读性上优于ChatGPT-3.5。结论对骨科运动医学手术患者的相关问题进行了较为满意的回答。然而,它的实用性仍然受到来源归属、高读取复杂性和准确性可变性的挑战的限制。证据水平:V级,对V级研究的系统评价。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.70
自引率
0.00%
发文量
218
审稿时长
45 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信