可读性是可靠性的代表吗?ChatGPT-4.0在骨科创伤沟通中的定性评价

IF 2.1 4区 医学 Q3 HEALTH CARE SCIENCES & SERVICES
Mehmet Okan Atahan, Çağrı Üner, Mehmet Aydemir, Mehmet Fatih Uzun, Mustafa Yalın, Fatih Gölgelioğlu
{"title":"可读性是可靠性的代表吗?ChatGPT-4.0在骨科创伤沟通中的定性评价","authors":"Mehmet Okan Atahan,&nbsp;Çağrı Üner,&nbsp;Mehmet Aydemir,&nbsp;Mehmet Fatih Uzun,&nbsp;Mustafa Yalın,&nbsp;Fatih Gölgelioğlu","doi":"10.1111/jep.70238","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Aim</h3>\n \n <p>This study aimed to evaluate the accuracy, readability, and safety of ChatGPT-4.0's responses to frequently asked questions (FAQs) related to orthopaedic trauma and to examine whether readability is associated with the quality and reliability of content.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>Ten common patient questions related to orthopaedic emergencies were submitted to ChatGPT-4.0. Each response was assessed independently by three orthopaedic trauma surgeons using a 4-point ordinal scale for accuracy, clinical appropriateness, and safety. Readability was calculated using the Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was analysed using intraclass correlation coefficients (ICC). The presence of disclaimers was also recorded.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>ChatGPT-4.0's responses had a mean FKGL score of 10.5, indicating high school-level readability. Stratified analysis showed comparable readability scores across response quality categories: excellent (10.0), poor (9.8), and dangerous (10.1), suggesting that readability does not predict content reliability. Accuracy and safety scores varied considerably among responses, with the highest inter-rater agreement in clinical appropriateness (ICC = 0.81) and the lowest in safety assessments (ICC = 0.68). Notably, nine out of 10 responses included a disclaimer indicating the nonprofessional nature of the content, with one omission observed in a high-risk clinical scenario.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>Although ChatGPT-4.0 provides generally readable responses to orthopaedic trauma questions, readability does not reliably distinguish between accurate and potentially harmful information. These findings highlight the need for expert review when using AI-generated content in clinical communication.</p>\n </section>\n </div>","PeriodicalId":15997,"journal":{"name":"Journal of evaluation in clinical practice","volume":"31 5","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2025-08-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT-4.0 in Orthopaedic Trauma Communication\",\"authors\":\"Mehmet Okan Atahan,&nbsp;Çağrı Üner,&nbsp;Mehmet Aydemir,&nbsp;Mehmet Fatih Uzun,&nbsp;Mustafa Yalın,&nbsp;Fatih Gölgelioğlu\",\"doi\":\"10.1111/jep.70238\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div>\\n \\n \\n <section>\\n \\n <h3> Aim</h3>\\n \\n <p>This study aimed to evaluate the accuracy, readability, and safety of ChatGPT-4.0's responses to frequently asked questions (FAQs) related to orthopaedic trauma and to examine whether readability is associated with the quality and reliability of content.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Methods</h3>\\n \\n <p>Ten common patient questions related to orthopaedic emergencies were submitted to ChatGPT-4.0. Each response was assessed independently by three orthopaedic trauma surgeons using a 4-point ordinal scale for accuracy, clinical appropriateness, and safety. Readability was calculated using the Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was analysed using intraclass correlation coefficients (ICC). The presence of disclaimers was also recorded.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Results</h3>\\n \\n <p>ChatGPT-4.0's responses had a mean FKGL score of 10.5, indicating high school-level readability. Stratified analysis showed comparable readability scores across response quality categories: excellent (10.0), poor (9.8), and dangerous (10.1), suggesting that readability does not predict content reliability. Accuracy and safety scores varied considerably among responses, with the highest inter-rater agreement in clinical appropriateness (ICC = 0.81) and the lowest in safety assessments (ICC = 0.68). Notably, nine out of 10 responses included a disclaimer indicating the nonprofessional nature of the content, with one omission observed in a high-risk clinical scenario.</p>\\n </section>\\n \\n <section>\\n \\n <h3> Conclusion</h3>\\n \\n <p>Although ChatGPT-4.0 provides generally readable responses to orthopaedic trauma questions, readability does not reliably distinguish between accurate and potentially harmful information. These findings highlight the need for expert review when using AI-generated content in clinical communication.</p>\\n </section>\\n </div>\",\"PeriodicalId\":15997,\"journal\":{\"name\":\"Journal of evaluation in clinical practice\",\"volume\":\"31 5\",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2025-08-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of evaluation in clinical practice\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://onlinelibrary.wiley.com/doi/10.1111/jep.70238\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of evaluation in clinical practice","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jep.70238","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0

摘要

目的本研究旨在评估ChatGPT-4.0对骨科创伤相关常见问题(FAQs)回答的准确性、可读性和安全性,并研究可读性是否与内容的质量和可靠性相关。方法将10个与骨科急诊相关的患者常见问题提交至ChatGPT-4.0。每个反应由三名骨科创伤外科医生独立评估,采用4分顺序量表评估准确性、临床适宜性和安全性。可读性采用Flesch-Kincaid Grade Level (FKGL)计算。采用类内相关系数(ICC)分析分级间一致性。免责声明的存在也被记录下来。结果ChatGPT-4.0的平均FKGL得分为10.5,显示高中水平的可读性。分层分析显示,在不同的回答质量类别中,可读性得分可比较:优秀(10.0)、差(9.8)和危险(10.1),这表明可读性并不能预测内容的可靠性。准确性和安全性评分在应答者之间差异很大,在临床适宜性方面的评分一致性最高(ICC = 0.81),在安全性评估方面的评分一致性最低(ICC = 0.68)。值得注意的是,10个答复中有9个包括免责声明,表明内容的非专业性质,其中一个遗漏是在高风险临床场景中观察到的。结论:虽然ChatGPT-4.0为骨科创伤问题提供了普遍可读的回答,但可读性并不能可靠地区分准确和潜在有害的信息。这些发现强调了在临床交流中使用人工智能生成的内容时需要专家审查。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Is Readability a Proxy for Reliability? A Qualitative Evaluation of ChatGPT-4.0 in Orthopaedic Trauma Communication

Aim

This study aimed to evaluate the accuracy, readability, and safety of ChatGPT-4.0's responses to frequently asked questions (FAQs) related to orthopaedic trauma and to examine whether readability is associated with the quality and reliability of content.

Methods

Ten common patient questions related to orthopaedic emergencies were submitted to ChatGPT-4.0. Each response was assessed independently by three orthopaedic trauma surgeons using a 4-point ordinal scale for accuracy, clinical appropriateness, and safety. Readability was calculated using the Flesch-Kincaid Grade Level (FKGL). Inter-rater agreement was analysed using intraclass correlation coefficients (ICC). The presence of disclaimers was also recorded.

Results

ChatGPT-4.0's responses had a mean FKGL score of 10.5, indicating high school-level readability. Stratified analysis showed comparable readability scores across response quality categories: excellent (10.0), poor (9.8), and dangerous (10.1), suggesting that readability does not predict content reliability. Accuracy and safety scores varied considerably among responses, with the highest inter-rater agreement in clinical appropriateness (ICC = 0.81) and the lowest in safety assessments (ICC = 0.68). Notably, nine out of 10 responses included a disclaimer indicating the nonprofessional nature of the content, with one omission observed in a high-risk clinical scenario.

Conclusion

Although ChatGPT-4.0 provides generally readable responses to orthopaedic trauma questions, readability does not reliably distinguish between accurate and potentially harmful information. These findings highlight the need for expert review when using AI-generated content in clinical communication.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
4.80
自引率
4.20%
发文量
143
审稿时长
3-8 weeks
期刊介绍: The Journal of Evaluation in Clinical Practice aims to promote the evaluation and development of clinical practice across medicine, nursing and the allied health professions. All aspects of health services research and public health policy analysis and debate are of interest to the Journal whether studied from a population-based or individual patient-centred perspective. Of particular interest to the Journal are submissions on all aspects of clinical effectiveness and efficiency including evidence-based medicine, clinical practice guidelines, clinical decision making, clinical services organisation, implementation and delivery, health economic evaluation, health process and outcome measurement and new or improved methods (conceptual and statistical) for systematic inquiry into clinical practice. Papers may take a classical quantitative or qualitative approach to investigation (or may utilise both techniques) or may take the form of learned essays, structured/systematic reviews and critiques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信