Mehmet Okan Atahan, Çağrı Üner, Mehmet Aydemir, Mehmet Fatih Uzun, Mustafa Yalın, Fatih Gölgelioğlu
{"title":"Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT-4.0 in Orthopaedic Trauma Communication","authors":"Mehmet Okan Atahan, Çağrı Üner, Mehmet Aydemir, Mehmet Fatih Uzun, Mustafa Yalın, Fatih Gölgelioğlu","doi":"10.1111/jep.70238","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Aim</h3>\n \n <p>This study aimed to evaluate the accuracy, readability, and safety of ChatGPT-4.0's responses to frequently asked questions (FAQs) related to orthopaedic trauma and to examine whether readability is associated with the quality and reliability of content.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Ten common patient questions related to orthopaedic emergencies were submitted to ChatGPT-4.0. Each response was assessed independently by three orthopaedic trauma surgeons using a 4-point ordinal scale for accuracy, clinical appropriateness, and safety. Readability was calculated using the Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was analysed using intraclass correlation coefficients (ICC). The presence of disclaimers was also recorded.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>ChatGPT-4.0's responses had a mean FKGL score of 10.5, indicating high school-level readability. Stratified analysis showed comparable readability scores across response quality categories: excellent (10.0), poor (9.8), and dangerous (10.1), suggesting that readability does not predict content reliability. Accuracy and safety scores varied considerably among responses, with the highest inter-rater agreement in clinical appropriateness (ICC = 0.81) and the lowest in safety assessments (ICC = 0.68). Notably, nine out of 10 responses included a disclaimer indicating the nonprofessional nature of the content, with one omission observed in a high-risk clinical scenario.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>Although ChatGPT-4.0 provides generally readable responses to orthopaedic trauma questions, readability does not reliably distinguish between accurate and potentially harmful information. These findings highlight the need for expert review when using AI-generated content in clinical communication.</p>\n </section>\n </div>","PeriodicalId":15997,"journal":{"name":"Journal of evaluation in clinical practice","volume":"31 5","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of evaluation in clinical practice","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jep.70238","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Aim
This study aimed to evaluate the accuracy, readability, and safety of ChatGPT-4.0's responses to frequently asked questions (FAQs) related to orthopaedic trauma and to examine whether readability is associated with the quality and reliability of content.
Methods
Ten common patient questions related to orthopaedic emergencies were submitted to ChatGPT-4.0. Each response was assessed independently by three orthopaedic trauma surgeons using a 4-point ordinal scale for accuracy, clinical appropriateness, and safety. Readability was calculated using the Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was analysed using intraclass correlation coefficients (ICC). The presence of disclaimers was also recorded.
Results
ChatGPT-4.0's responses had a mean FKGL score of 10.5, indicating high school-level readability. Stratified analysis showed comparable readability scores across response quality categories: excellent (10.0), poor (9.8), and dangerous (10.1), suggesting that readability does not predict content reliability. Accuracy and safety scores varied considerably among responses, with the highest inter-rater agreement in clinical appropriateness (ICC = 0.81) and the lowest in safety assessments (ICC = 0.68). Notably, nine out of 10 responses included a disclaimer indicating the nonprofessional nature of the content, with one omission observed in a high-risk clinical scenario.
Conclusion
Although ChatGPT-4.0 provides generally readable responses to orthopaedic trauma questions, readability does not reliably distinguish between accurate and potentially harmful information. These findings highlight the need for expert review when using AI-generated content in clinical communication.
期刊介绍:
The Journal of Evaluation in Clinical Practice aims to promote the evaluation and development of clinical practice across medicine, nursing and the allied health professions. All aspects of health services research and public health policy analysis and debate are of interest to the Journal whether studied from a population-based or individual patient-centred perspective. Of particular interest to the Journal are submissions on all aspects of clinical effectiveness and efficiency including evidence-based medicine, clinical practice guidelines, clinical decision making, clinical services organisation, implementation and delivery, health economic evaluation, health process and outcome measurement and new or improved methods (conceptual and statistical) for systematic inquiry into clinical practice. Papers may take a classical quantitative or qualitative approach to investigation (or may utilise both techniques) or may take the form of learned essays, structured/systematic reviews and critiques.