Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT-4.0 in Orthopaedic Trauma Communication

IF 2.1 4区医学 Q3 HEALTH CARE SCIENCES & SERVICES

Journal of evaluation in clinical practice Pub Date : 2025-08-14 DOI:10.1111/jep.70238

Mehmet Okan Atahan, Çağrı Üner, Mehmet Aydemir, Mehmet Fatih Uzun, Mustafa Yalın, Fatih Gölgelioğlu

{"title":"Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT-4.0 in Orthopaedic Trauma Communication","authors":"Mehmet Okan Atahan, Çağrı Üner, Mehmet Aydemir, Mehmet Fatih Uzun, Mustafa Yalın, Fatih Gölgelioğlu","doi":"10.1111/jep.70238","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Aim</h3>\n \n <p>This study aimed to evaluate the accuracy, readability, and safety of ChatGPT-4.0's responses to frequently asked questions (FAQs) related to orthopaedic trauma and to examine whether readability is associated with the quality and reliability of content.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Ten common patient questions related to orthopaedic emergencies were submitted to ChatGPT-4.0. Each response was assessed independently by three orthopaedic trauma surgeons using a 4-point ordinal scale for accuracy, clinical appropriateness, and safety. Readability was calculated using the Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was analysed using intraclass correlation coefficients (ICC). The presence of disclaimers was also recorded.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>ChatGPT-4.0's responses had a mean FKGL score of 10.5, indicating high school-level readability. Stratified analysis showed comparable readability scores across response quality categories: excellent (10.0), poor (9.8), and dangerous (10.1), suggesting that readability does not predict content reliability. Accuracy and safety scores varied considerably among responses, with the highest inter-rater agreement in clinical appropriateness (ICC = 0.81) and the lowest in safety assessments (ICC = 0.68). Notably, nine out of 10 responses included a disclaimer indicating the nonprofessional nature of the content, with one omission observed in a high-risk clinical scenario.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>Although ChatGPT-4.0 provides generally readable responses to orthopaedic trauma questions, readability does not reliably distinguish between accurate and potentially harmful information. These findings highlight the need for expert review when using AI-generated content in clinical communication.</p>\n </section>\n </div>","PeriodicalId":15997,"journal":{"name":"Journal of evaluation in clinical practice","volume":"31 5","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of evaluation in clinical practice","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jep.70238","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Aim

This study aimed to evaluate the accuracy, readability, and safety of ChatGPT-4.0's responses to frequently asked questions (FAQs) related to orthopaedic trauma and to examine whether readability is associated with the quality and reliability of content.

Methods

Ten common patient questions related to orthopaedic emergencies were submitted to ChatGPT-4.0. Each response was assessed independently by three orthopaedic trauma surgeons using a 4-point ordinal scale for accuracy, clinical appropriateness, and safety. Readability was calculated using the Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was analysed using intraclass correlation coefficients (ICC). The presence of disclaimers was also recorded.

Results

ChatGPT-4.0's responses had a mean FKGL score of 10.5, indicating high school-level readability. Stratified analysis showed comparable readability scores across response quality categories: excellent (10.0), poor (9.8), and dangerous (10.1), suggesting that readability does not predict content reliability. Accuracy and safety scores varied considerably among responses, with the highest inter-rater agreement in clinical appropriateness (ICC = 0.81) and the lowest in safety assessments (ICC = 0.68). Notably, nine out of 10 responses included a disclaimer indicating the nonprofessional nature of the content, with one omission observed in a high-risk clinical scenario.

Conclusion

Although ChatGPT-4.0 provides generally readable responses to orthopaedic trauma questions, readability does not reliably distinguish between accurate and potentially harmful information. These findings highlight the need for expert review when using AI-generated content in clinical communication.

查看原文本刊更多论文

可读性是可靠性的代表吗？ChatGPT-4.0在骨科创伤沟通中的定性评价

目的本研究旨在评估ChatGPT-4.0对骨科创伤相关常见问题（FAQs）回答的准确性、可读性和安全性，并研究可读性是否与内容的质量和可靠性相关。方法将10个与骨科急诊相关的患者常见问题提交至ChatGPT-4.0。每个反应由三名骨科创伤外科医生独立评估，采用4分顺序量表评估准确性、临床适宜性和安全性。可读性采用Flesch-Kincaid Grade Level （FKGL）计算。采用类内相关系数（ICC）分析分级间一致性。免责声明的存在也被记录下来。结果ChatGPT-4.0的平均FKGL得分为10.5，显示高中水平的可读性。分层分析显示，在不同的回答质量类别中，可读性得分可比较：优秀（10.0）、差（9.8）和危险（10.1），这表明可读性并不能预测内容的可靠性。准确性和安全性评分在应答者之间差异很大，在临床适宜性方面的评分一致性最高（ICC = 0.81），在安全性评估方面的评分一致性最低（ICC = 0.68）。值得注意的是，10个答复中有9个包括免责声明，表明内容的非专业性质，其中一个遗漏是在高风险临床场景中观察到的。结论：虽然ChatGPT-4.0为骨科创伤问题提供了普遍可读的回答，但可读性并不能可靠地区分准确和潜在有害的信息。这些发现强调了在临床交流中使用人工智能生成的内容时需要专家审查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of evaluation in clinical practice 医学-卫生保健

CiteScore

4.80

自引率

4.20%

发文量

143

审稿时长

3-8 weeks

期刊介绍： The Journal of Evaluation in Clinical Practice aims to promote the evaluation and development of clinical practice across medicine, nursing and the allied health professions. All aspects of health services research and public health policy analysis and debate are of interest to the Journal whether studied from a population-based or individual patient-centred perspective. Of particular interest to the Journal are submissions on all aspects of clinical effectiveness and efficiency including evidence-based medicine, clinical practice guidelines, clinical decision making, clinical services organisation, implementation and delivery, health economic evaluation, health process and outcome measurement and new or improved methods (conceptual and statistical) for systematic inquiry into clinical practice. Papers may take a classical quantitative or qualitative approach to investigation (or may utilise both techniques) or may take the form of learned essays, structured/systematic reviews and critiques.