Evaluating if ChatGPT Can Answer Common Patient Questions Compared With OrthoInfo Regarding Rotator Cuff Tears.

IF 2 Q2 ORTHOPEDICS

Journal of the American Academy of Orthopaedic Surgeons Global Research and Reviews Pub Date : 2025-03-11 eCollection Date: 2025-03-01 DOI:10.5435/JAAOSGlobal-D-24-00289

Alexander Jurayj, Julio Nerys-Figueroa, Emil Espinal, Michael A Gaudiani, Travis Baes, Jared Mahylis, Stephanie Muh

{"title":"Evaluating if ChatGPT Can Answer Common Patient Questions Compared With OrthoInfo Regarding Rotator Cuff Tears.","authors":"Alexander Jurayj, Julio Nerys-Figueroa, Emil Espinal, Michael A Gaudiani, Travis Baes, Jared Mahylis, Stephanie Muh","doi":"10.5435/JAAOSGlobal-D-24-00289","DOIUrl":null,"url":null,"abstract":"Purpose: To evaluate ChatGPT's (OpenAI) ability to provide accurate, appropriate, and readable responses to common patient questions about rotator cuff tears.Methods: Eight questions from the OrthoInfo rotator cuff tear web page were input into ChatGPT at two levels: standard and at a sixth-grade reading level. Five orthopaedic surgeons assessed the accuracy and appropriateness of responses using a Likert scale, and the Flesch-Kincaid Grade Level measured readability. Results were analyzed with a paired Student t-test.Results: Standard ChatGPT responses scored higher in accuracy (4.7 ± 0.47 vs. 3.6 ± 0.76; P < 0.001) and appropriateness (4.5 ± 0.57 vs. 3.7 ± 0.98; P < 0.001) compared with sixth-grade responses. However, standard ChatGPT responses were less accurate (4.7 ± 0.47 vs. 5.0 ± 0.0; P = 0.004) and appropriate (4.5 ± 0.57 vs. 5.0 ± 0.0; P = 0.016) when compared with OrthoInfo responses. OrthoInfo responses were also notably better than sixth-grade responses in both accuracy and appropriateness (P < 0.001). Standard responses had a higher Flesch-Kincaid grade level compared with both OrthoInfo and sixth-grade responses (P < 0.001).Conclusion: Standard ChatGPT responses were less accurate and appropriate, with worse readability compared with OrthoInfo responses. Despite being easier to read, sixth-grade level ChatGPT responses compromised on accuracy and appropriateness. At this time, ChatGPT is not recommended as a standalone source for patient information on rotator cuff tears but may supplement information provided by orthopaedic surgeons.","PeriodicalId":45062,"journal":{"name":"Journal of the American Academy of Orthopaedic Surgeons Global Research and Reviews","volume":"9 3","pages":""},"PeriodicalIF":2.0000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11905972/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Academy of Orthopaedic Surgeons Global Research and Reviews","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5435/JAAOSGlobal-D-24-00289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To evaluate ChatGPT's (OpenAI) ability to provide accurate, appropriate, and readable responses to common patient questions about rotator cuff tears.

Methods: Eight questions from the OrthoInfo rotator cuff tear web page were input into ChatGPT at two levels: standard and at a sixth-grade reading level. Five orthopaedic surgeons assessed the accuracy and appropriateness of responses using a Likert scale, and the Flesch-Kincaid Grade Level measured readability. Results were analyzed with a paired Student t-test.

Results: Standard ChatGPT responses scored higher in accuracy (4.7 ± 0.47 vs. 3.6 ± 0.76; P < 0.001) and appropriateness (4.5 ± 0.57 vs. 3.7 ± 0.98; P < 0.001) compared with sixth-grade responses. However, standard ChatGPT responses were less accurate (4.7 ± 0.47 vs. 5.0 ± 0.0; P = 0.004) and appropriate (4.5 ± 0.57 vs. 5.0 ± 0.0; P = 0.016) when compared with OrthoInfo responses. OrthoInfo responses were also notably better than sixth-grade responses in both accuracy and appropriateness (P < 0.001). Standard responses had a higher Flesch-Kincaid grade level compared with both OrthoInfo and sixth-grade responses (P < 0.001).

Conclusion: Standard ChatGPT responses were less accurate and appropriate, with worse readability compared with OrthoInfo responses. Despite being easier to read, sixth-grade level ChatGPT responses compromised on accuracy and appropriateness. At this time, ChatGPT is not recommended as a standalone source for patient information on rotator cuff tears but may supplement information provided by orthopaedic surgeons.

查看原文本刊更多论文

与OrthoInfo相比，评估ChatGPT是否能回答关于肩袖撕裂的常见问题。

目的：评估ChatGPT （OpenAI）的能力，为患者关于肩袖撕裂的常见问题提供准确、适当和可读的回答。方法：将来自OrthoInfo肌腱套撕裂网页的8个问题按标准阅读水平和六年级阅读水平输入ChatGPT。5位骨科医生使用李克特量表评估反应的准确性和适当性，并用Flesch-Kincaid Grade Level测量可读性。结果采用配对学生t检验进行分析。结果：标准ChatGPT回答的准确率更高(4.7±0.47比3.6±0.76；P < 0.001)和适宜性(4.5±0.57∶3.7±0.98；P < 0.001)。然而，标准ChatGPT反应的准确性较低(4.7±0.47 vs 5.0±0.0；P = 0.004)和适当(4.5±0.57 vs. 5.0±0.0；P = 0.016)。在准确性和适当性方面，OrthoInfo的回答也显著优于六年级的回答（P < 0.001）。与OrthoInfo和六年级应答相比，标准应答具有更高的Flesch-Kincaid等级水平（P < 0.001）。结论：与OrthoInfo应答相比，标准ChatGPT应答的准确性和适宜性较差，可读性较差。尽管更容易阅读，但六年级水平的ChatGPT回答在准确性和适当性上有所妥协。目前，ChatGPT不推荐作为肌腱套撕裂患者信息的独立来源，但可以补充骨科医生提供的信息。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊