Can popular AI large language models provide reliable answers to frequently asked questions about rotator cuff tears?

Q2 Medicine

JSES International Pub Date : 2025-03-01 DOI:10.1016/j.jseint.2024.11.012

Ulas Can Kolac MD , Orhan Mete Karademir , Gokhan Ayik MD , Mehmet Kaymakoglu MD , Filippo Familiari MD , Gazi Huri MD

{"title":"Can popular AI large language models provide reliable answers to frequently asked questions about rotator cuff tears?","authors":"Ulas Can Kolac MD , Orhan Mete Karademir , Gokhan Ayik MD , Mehmet Kaymakoglu MD , Filippo Familiari MD , Gazi Huri MD","doi":"10.1016/j.jseint.2024.11.012","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Rotator cuff tears are common upper-extremity injuries that significantly impair shoulder function, leading to pain, reduced range of motion, and a decrease in quality of life. With the increasing reliance on artificial intelligence large language models (AI LLMs) for health information, it is crucial to evaluate the quality and readability of the information provided by these models.</div></div><div><h3>Methods</h3><div>A pool of 50 questions was generated related to rotator cuff tear by querying popular AI LLMs (ChatGPT 3.5, ChatGPT 4, Gemini, and Microsoft CoPilot) and using Google search. After that, responses from the AI LLMs were saved and evaluated. For information quality the DISCERN tool and a Likert Scale was used, for readability the Patient Education Materials Assessment Tool for Printable Materials (PEMAT) Understandability Score and the Flesch-Kincaid Reading Ease Score was used. Two orthopedic surgeons assessed the responses, and discrepancies were resolved by a senior author.</div></div><div><h3>Results</h3><div>Out of 198 answers, the median DISCERN score was 40, with 56.6% considered sufficient. The Likert Scale showed 96% sufficiency. The median PEMAT Understandability score was 83.33, with 77.3% sufficiency, while the Flesch-Kincaid Reading Ease score had a median of 42.05 with 88.9% sufficiency. Overall, 39.8% of the answers were sufficient in both information quality and readability. Differences were found among AI models in DISCERN, Likert, PEMAT Understandability, and Flesch-Kincaid scores.</div></div><div><h3>Conclusion</h3><div>AI LLMs generally cannot offer sufficient information quality and readability. While they are not ready for use in medical field, they show a promising future. There is a necessity for continuous re-evaluation of these models due to their rapid evolution. Developing new, comprehensive tools for evaluating medical information quality and readability is crucial for ensuring these models can effectively support patient education. Future research should focus on enhancing readability and consistent information quality to better serve patients.</div></div>","PeriodicalId":34444,"journal":{"name":"JSES International","volume":"9 2","pages":"Pages 390-397"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JSES International","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666638324004717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Rotator cuff tears are common upper-extremity injuries that significantly impair shoulder function, leading to pain, reduced range of motion, and a decrease in quality of life. With the increasing reliance on artificial intelligence large language models (AI LLMs) for health information, it is crucial to evaluate the quality and readability of the information provided by these models.

Methods

A pool of 50 questions was generated related to rotator cuff tear by querying popular AI LLMs (ChatGPT 3.5, ChatGPT 4, Gemini, and Microsoft CoPilot) and using Google search. After that, responses from the AI LLMs were saved and evaluated. For information quality the DISCERN tool and a Likert Scale was used, for readability the Patient Education Materials Assessment Tool for Printable Materials (PEMAT) Understandability Score and the Flesch-Kincaid Reading Ease Score was used. Two orthopedic surgeons assessed the responses, and discrepancies were resolved by a senior author.

Results

Out of 198 answers, the median DISCERN score was 40, with 56.6% considered sufficient. The Likert Scale showed 96% sufficiency. The median PEMAT Understandability score was 83.33, with 77.3% sufficiency, while the Flesch-Kincaid Reading Ease score had a median of 42.05 with 88.9% sufficiency. Overall, 39.8% of the answers were sufficient in both information quality and readability. Differences were found among AI models in DISCERN, Likert, PEMAT Understandability, and Flesch-Kincaid scores.

Conclusion

AI LLMs generally cannot offer sufficient information quality and readability. While they are not ready for use in medical field, they show a promising future. There is a necessity for continuous re-evaluation of these models due to their rapid evolution. Developing new, comprehensive tools for evaluating medical information quality and readability is crucial for ensuring these models can effectively support patient education. Future research should focus on enhancing readability and consistent information quality to better serve patients.

查看原文本刊更多论文

流行的人工智能大语言模型能否为有关肩袖撕裂的常见问题提供可靠的答案？

背景：肩袖撕裂是一种常见的上肢损伤，严重损害肩功能，导致疼痛、活动范围缩小和生活质量下降。随着越来越多地依赖于人工智能大语言模型（AI llm）来获取健康信息，评估这些模型提供的信息的质量和可读性至关重要。方法通过查询流行的人工智能法学硕士（ChatGPT 3.5、ChatGPT 4、Gemini和Microsoft CoPilot）并使用谷歌搜索，生成与肩袖撕裂相关的50个问题库。之后，AI法学硕士的回答被保存并评估。信息质量采用DISCERN工具和李克特量表，可读性采用可打印材料患者教育材料评估工具（PEMAT）可理解性评分和Flesch-Kincaid阅读简易评分。两位骨科医生评估了患者的反应，并由一位资深作者解决了差异。结果在198个答案中，辨别得分中位数为40分，56.6%的人认为足够。李克特量表的充分性为96%。PEMAT可理解性得分中位数为83.33，充分率为77.3%；Flesch-Kincaid Reading Ease得分中位数为42.05，充分率为88.9%。总体而言，39.8%的答案在信息质量和可读性方面都是足够的。人工智能模型在DISCERN， Likert， PEMAT可理解性和Flesch-Kincaid得分方面存在差异。结论人工智能法学硕士一般不能提供足够的信息质量和可读性。虽然它们还没有准备好在医学领域使用，但它们显示出了良好的前景。由于这些模型的快速演变，有必要对其进行不断的重新评估。开发新的综合工具来评估医疗信息的质量和可读性，对于确保这些模型能够有效地支持患者教育至关重要。未来的研究应着眼于提高可读性和一致性的信息质量，以更好地为患者服务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊