Evaluation of the reliability, usefulness, quality and readability of ChatGPT's responses on Scoliosis.

IF 1.4 Q3 ORTHOPEDICS

European Journal of Orthopaedic Surgery and Traumatology Pub Date : 2025-03-18 DOI:10.1007/s00590-025-04198-4

Ayşe Merve Çıracıoğlu, Suheyla Dal Erdoğan

{"title":"Evaluation of the reliability, usefulness, quality and readability of ChatGPT's responses on Scoliosis.","authors":"Ayşe Merve Çıracıoğlu, Suheyla Dal Erdoğan","doi":"10.1007/s00590-025-04198-4","DOIUrl":null,"url":null,"abstract":"Objective: This study evaluates the reliability, usefulness, quality, and readability of ChatGPT's responses to frequently asked questions about scoliosis.Methods: Sixteen frequently asked questions, identified through an analysis of Google Trends data and clinical feedback, were presented to ChatGPT for evaluation. Two independent experts assessed the responses using a 7-point Likert scale for reliability and usefulness. Additionally, the overall quality was also rated using the Global Quality Scale (GQS). To assess readability, various established metrics were employed, including the Flesch Reading Ease score (FRE), the Simple Measure of Gobbledygook (SMOG) Index, the Coleman-Liau Index (CLI), the Gunning Fog Index (GFI), the Flesch-Kinkaid Grade Level (FKGL), the FORCAST Grade Level, and the Automated Readability Index (ARI).Results: The mean reliability scores were 4.68 ± 0.73 (Median: 5, IQR 4-5), while the mean usefulness scores were 4.84 ± 0.84 (Median: 5, IQR 4-5). Additionally the mean GQS scores were 4.28 ± 0.58 (Median: 4, IQR 4-5). Inter-rater reliability analysis using the Intraclass correlation coefficient showed excellent agreement: 0.942 for reliability, 0.935 for usefulness, and 0.868 for GQS. While general informational questions received high scores, responses to treatment-specific and personalized inquiries required greater depth and comprehensiveness. Readability analysis indicated that ChatGPT's responses required at least a high school senior to college-level reading ability.Conclusion: ChatGPT provides reliable, useful, and moderate quality information on scoliosis but has limitations in addressing treatment-specific and personalized inquiries. Caution is essential when using Artificial Intelligence (AI) in patient education and medical decision-making.","PeriodicalId":50484,"journal":{"name":"European Journal of Orthopaedic Surgery and Traumatology","volume":"35 1","pages":"123"},"PeriodicalIF":1.4000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Journal of Orthopaedic Surgery and Traumatology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00590-025-04198-4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This study evaluates the reliability, usefulness, quality, and readability of ChatGPT's responses to frequently asked questions about scoliosis.

Methods: Sixteen frequently asked questions, identified through an analysis of Google Trends data and clinical feedback, were presented to ChatGPT for evaluation. Two independent experts assessed the responses using a 7-point Likert scale for reliability and usefulness. Additionally, the overall quality was also rated using the Global Quality Scale (GQS). To assess readability, various established metrics were employed, including the Flesch Reading Ease score (FRE), the Simple Measure of Gobbledygook (SMOG) Index, the Coleman-Liau Index (CLI), the Gunning Fog Index (GFI), the Flesch-Kinkaid Grade Level (FKGL), the FORCAST Grade Level, and the Automated Readability Index (ARI).

Results: The mean reliability scores were 4.68 ± 0.73 (Median: 5, IQR 4-5), while the mean usefulness scores were 4.84 ± 0.84 (Median: 5, IQR 4-5). Additionally the mean GQS scores were 4.28 ± 0.58 (Median: 4, IQR 4-5). Inter-rater reliability analysis using the Intraclass correlation coefficient showed excellent agreement: 0.942 for reliability, 0.935 for usefulness, and 0.868 for GQS. While general informational questions received high scores, responses to treatment-specific and personalized inquiries required greater depth and comprehensiveness. Readability analysis indicated that ChatGPT's responses required at least a high school senior to college-level reading ability.

Conclusion: ChatGPT provides reliable, useful, and moderate quality information on scoliosis but has limitations in addressing treatment-specific and personalized inquiries. Caution is essential when using Artificial Intelligence (AI) in patient education and medical decision-making.

查看原文本刊更多论文

评价ChatGPT对脊柱侧凸反应的可靠性、有用性、质量和可读性。

目的：本研究评估ChatGPT对脊柱侧凸常见问题的回答的可靠性、有用性、质量和可读性。方法：通过分析谷歌Trends数据和临床反馈确定的16个常见问题，提交给ChatGPT进行评估。两位独立专家使用7分李克特量表评估了这些回答的可靠性和有用性。此外，还使用全球质量量表（GQS）对整体质量进行了评级。为了评估可读性，采用了各种既定的指标，包括Flesch Reading Ease score （FRE）、Simple Measure of Gobbledygook （SMOG）指数、Coleman-Liau指数（CLI）、Gunning Fog指数（GFI）、Flesch- kinkaid等级水平（FKGL）、cast等级水平和自动可读性指数（ARI）。结果：信度平均得分为4.68±0.73（中位数：5，IQR 4-5），有用性平均得分为4.84±0.84（中位数：5，IQR 4-5）。平均GQS评分为4.28±0.58（中位数：4，IQR 4-5）。采用类内相关系数进行信度分析，结果一致，信度为0.942，有用性为0.935，GQS为0.868。虽然一般的信息问题得到了高分，但对特定治疗和个性化询问的回答需要更深入和全面。可读性分析表明，ChatGPT的回答至少需要高中至大学水平的阅读能力。结论：ChatGPT提供了可靠、有用和中等质量的脊柱侧凸信息，但在解决特定治疗和个性化查询方面存在局限性。在患者教育和医疗决策中使用人工智能（AI）时，谨慎是必不可少的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

European Journal of Orthopaedic Surgery and Traumatology 医学-整形外科

CiteScore

3.00

自引率

5.90%

发文量

265

审稿时长

3-8 weeks

期刊介绍： The European Journal of Orthopaedic Surgery and Traumatology (EJOST) aims to publish high quality Orthopedic scientific work. The objective of our journal is to disseminate meaningful, impactful, clinically relevant work from each and every region of the world, that has the potential to change and or inform clinical practice.