Can popular AI large language models provide reliable answers to frequently asked questions about rotator cuff tears?

Q2 Medicine
Ulas Can Kolac MD , Orhan Mete Karademir , Gokhan Ayik MD , Mehmet Kaymakoglu MD , Filippo Familiari MD , Gazi Huri MD
{"title":"Can popular AI large language models provide reliable answers to frequently asked questions about rotator cuff tears?","authors":"Ulas Can Kolac MD ,&nbsp;Orhan Mete Karademir ,&nbsp;Gokhan Ayik MD ,&nbsp;Mehmet Kaymakoglu MD ,&nbsp;Filippo Familiari MD ,&nbsp;Gazi Huri MD","doi":"10.1016/j.jseint.2024.11.012","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Rotator cuff tears are common upper-extremity injuries that significantly impair shoulder function, leading to pain, reduced range of motion, and a decrease in quality of life. With the increasing reliance on artificial intelligence large language models (AI LLMs) for health information, it is crucial to evaluate the quality and readability of the information provided by these models.</div></div><div><h3>Methods</h3><div>A pool of 50 questions was generated related to rotator cuff tear by querying popular AI LLMs (ChatGPT 3.5, ChatGPT 4, Gemini, and Microsoft CoPilot) and using Google search. After that, responses from the AI LLMs were saved and evaluated. For information quality the DISCERN tool and a Likert Scale was used, for readability the Patient Education Materials Assessment Tool for Printable Materials (PEMAT) Understandability Score and the Flesch-Kincaid Reading Ease Score was used. Two orthopedic surgeons assessed the responses, and discrepancies were resolved by a senior author.</div></div><div><h3>Results</h3><div>Out of 198 answers, the median DISCERN score was 40, with 56.6% considered sufficient. The Likert Scale showed 96% sufficiency. The median PEMAT Understandability score was 83.33, with 77.3% sufficiency, while the Flesch-Kincaid Reading Ease score had a median of 42.05 with 88.9% sufficiency. Overall, 39.8% of the answers were sufficient in both information quality and readability. Differences were found among AI models in DISCERN, Likert, PEMAT Understandability, and Flesch-Kincaid scores.</div></div><div><h3>Conclusion</h3><div>AI LLMs generally cannot offer sufficient information quality and readability. While they are not ready for use in medical field, they show a promising future. There is a necessity for continuous re-evaluation of these models due to their rapid evolution. Developing new, comprehensive tools for evaluating medical information quality and readability is crucial for ensuring these models can effectively support patient education. Future research should focus on enhancing readability and consistent information quality to better serve patients.</div></div>","PeriodicalId":34444,"journal":{"name":"JSES International","volume":"9 2","pages":"Pages 390-397"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JSES International","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2666638324004717","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0

Abstract

Background

Rotator cuff tears are common upper-extremity injuries that significantly impair shoulder function, leading to pain, reduced range of motion, and a decrease in quality of life. With the increasing reliance on artificial intelligence large language models (AI LLMs) for health information, it is crucial to evaluate the quality and readability of the information provided by these models.

Methods

A pool of 50 questions was generated related to rotator cuff tear by querying popular AI LLMs (ChatGPT 3.5, ChatGPT 4, Gemini, and Microsoft CoPilot) and using Google search. After that, responses from the AI LLMs were saved and evaluated. For information quality the DISCERN tool and a Likert Scale was used, for readability the Patient Education Materials Assessment Tool for Printable Materials (PEMAT) Understandability Score and the Flesch-Kincaid Reading Ease Score was used. Two orthopedic surgeons assessed the responses, and discrepancies were resolved by a senior author.

Results

Out of 198 answers, the median DISCERN score was 40, with 56.6% considered sufficient. The Likert Scale showed 96% sufficiency. The median PEMAT Understandability score was 83.33, with 77.3% sufficiency, while the Flesch-Kincaid Reading Ease score had a median of 42.05 with 88.9% sufficiency. Overall, 39.8% of the answers were sufficient in both information quality and readability. Differences were found among AI models in DISCERN, Likert, PEMAT Understandability, and Flesch-Kincaid scores.

Conclusion

AI LLMs generally cannot offer sufficient information quality and readability. While they are not ready for use in medical field, they show a promising future. There is a necessity for continuous re-evaluation of these models due to their rapid evolution. Developing new, comprehensive tools for evaluating medical information quality and readability is crucial for ensuring these models can effectively support patient education. Future research should focus on enhancing readability and consistent information quality to better serve patients.
求助全文
约1分钟内获得全文 求助全文
来源期刊
JSES International
JSES International Medicine-Surgery
CiteScore
2.80
自引率
0.00%
发文量
174
审稿时长
14 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信