ChatGPT-3.5 and -4 provide mostly accurate information when answering patients’ questions relating to femoroacetabular impingement syndrome and arthroscopic hip surgery

IF 2.7 Q1 ORTHOPEDICS

Journal of ISAKOS Joint Disorders & Orthopaedic Sports Medicine Pub Date : 2025-02-01 DOI:10.1016/j.jisako.2024.100376

David Slawaska-Eng , Yoan Bourgeault-Gagnon , Dan Cohen , Thierry Pauyo , Etienne L. Belzile , Olufemi R. Ayeni

{"title":"ChatGPT-3.5 and -4 provide mostly accurate information when answering patients’ questions relating to femoroacetabular impingement syndrome and arthroscopic hip surgery","authors":"David Slawaska-Eng , Yoan Bourgeault-Gagnon , Dan Cohen , Thierry Pauyo , Etienne L. Belzile , Olufemi R. Ayeni","doi":"10.1016/j.jisako.2024.100376","DOIUrl":null,"url":null,"abstract":"<div><h3>Objectives</h3><div>This study aimed to evaluate the accuracy of ChatGPT in answering patient questions about femoroacetabular impingement (FAI) and arthroscopic hip surgery, comparing the performance of versions ChatGPT-3.5 (free) and ChatGPT-4 (paid).</div></div><div><h3>Methods</h3><div>Twelve frequently asked questions (FAQs) relating to FAI were selected and posed to ChatGPT-3.5 and ChatGPT-4. The responses were assessed for accuracy by three hip arthroscopy surgeons using a four-tier grading system. Statistical analyses included Wilcoxon signed-rank tests and Gwet's AC2 coefficient for interrater agreement corrected for chance and employing quadratic weights.</div></div><div><h3>Results</h3><div>The median ratings for responses ranged from “excellent not requiring clarification” to “satisfactory requiring moderate clarification.” No responses were rated as “unsatisfactory requiring substantial clarification.” The median accuracy scores were 2 (range 1–3) for ChatGPT-3.5 and 1.5 (range 1–3) for ChatGPT-4, with 25 % of ChatGPT-3.5's responses and 50 % of ChatGPT-4's responses rated as “excellent.” There was no statistical difference in performance between the two versions (p = 0.279) although ChatGPT-4 showed a tendency towards higher accuracy in some areas. Interrater agreement was substantial for ChatGPT-3.5 (Gwet's AC2 = 0.79 [95% confidence interval (CI) = 0.6–0.94]) and moderate to substantial for ChatGPT-4 (Gwet's AC2 = 0.65 [95% CI = 0.43–0.87]).</div></div><div><h3>Conclusion</h3><div>Both versions of ChatGPT provided mostly accurate responses to FAQs on FAI and arthroscopic surgery, with no significant difference between the versions. The findings suggest potential utility of ChatGPT in patient education, though cautious implementation and further evaluation are recommended due to variability in response accuracy and low power of the study.</div></div><div><h3>Level of evidence</h3><div>IV.</div></div>","PeriodicalId":36847,"journal":{"name":"Journal of ISAKOS Joint Disorders & Orthopaedic Sports Medicine","volume":"10 ","pages":"Article 100376"},"PeriodicalIF":2.7000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of ISAKOS Joint Disorders & Orthopaedic Sports Medicine","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2059775424002232","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Objectives

This study aimed to evaluate the accuracy of ChatGPT in answering patient questions about femoroacetabular impingement (FAI) and arthroscopic hip surgery, comparing the performance of versions ChatGPT-3.5 (free) and ChatGPT-4 (paid).

Methods

Twelve frequently asked questions (FAQs) relating to FAI were selected and posed to ChatGPT-3.5 and ChatGPT-4. The responses were assessed for accuracy by three hip arthroscopy surgeons using a four-tier grading system. Statistical analyses included Wilcoxon signed-rank tests and Gwet's AC2 coefficient for interrater agreement corrected for chance and employing quadratic weights.

Results

The median ratings for responses ranged from “excellent not requiring clarification” to “satisfactory requiring moderate clarification.” No responses were rated as “unsatisfactory requiring substantial clarification.” The median accuracy scores were 2 (range 1–3) for ChatGPT-3.5 and 1.5 (range 1–3) for ChatGPT-4, with 25 % of ChatGPT-3.5's responses and 50 % of ChatGPT-4's responses rated as “excellent.” There was no statistical difference in performance between the two versions (p = 0.279) although ChatGPT-4 showed a tendency towards higher accuracy in some areas. Interrater agreement was substantial for ChatGPT-3.5 (Gwet's AC2 = 0.79 [95% confidence interval (CI) = 0.6–0.94]) and moderate to substantial for ChatGPT-4 (Gwet's AC2 = 0.65 [95% CI = 0.43–0.87]).

Conclusion

Both versions of ChatGPT provided mostly accurate responses to FAQs on FAI and arthroscopic surgery, with no significant difference between the versions. The findings suggest potential utility of ChatGPT in patient education, though cautious implementation and further evaluation are recommended due to variability in response accuracy and low power of the study.

Level of evidence

IV.

查看原文本刊更多论文

ChatGPT 3.5 和 4 在回答患者有关股骨髋臼撞击综合征和髋关节镜手术的问题时提供了基本准确的信息。

目的：本研究旨在评估ChatGPT在回答患者关于股髋臼撞击（FAI）和关节镜髋关节手术的准确性，并比较ChatGPT-3.5（免费）和ChatGPT-4（付费）版本的性能。方法：选取与FAI相关的12个常见问题（FAQs），分别对ChatGPT-3.5和ChatGPT-4进行提问。三位髋关节镜外科医生使用四级评分系统评估反应的准确性。统计分析包括Wilcoxon sign -rank检验和Gwet的AC2系数，用于对机会进行修正并采用二次权。结果：回答的中位数评分范围从“优秀不需要澄清”到“满意需要适度澄清”。没有回答被评为“不满意，需要进行实质性澄清”。ChatGPT-3.5的中位准确度得分为2（范围1-3），ChatGPT-4的中位准确度得分为1.5（范围1-3），其中25%的ChatGPT-3.5的回答和50%的ChatGPT-4的回答被评为“优秀”。尽管ChatGPT-4在某些领域显示出更高的准确性，但两个版本之间的性能没有统计学差异（p = 0.279）。ChatGPT-3.5的评分一致（Gwet的AC2 = 0.79 [95%CI = 0.6 - 0.94]）， ChatGPT-4的评分一致（Gwet的AC2 = 0.65 [95%CI = 0.43 - 0.87]）。结论：两种版本的ChatGPT对FAI和关节镜手术常见问题的回答基本准确，两种版本之间无显著差异。研究结果表明ChatGPT在患者教育中的潜在效用，但由于反应准确性的变化和研究的低功率，建议谨慎实施和进一步评估。证据等级：四级。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊