Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.

IF 1.3 4区 医学 Q3 ORTHOPEDICS
Burak Tayyip Dede, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer, Mustafa Turgut Yıldızgören
{"title":"Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.","authors":"Burak Tayyip Dede, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer, Mustafa Turgut Yıldızgören","doi":"10.1177/15563316251340697","DOIUrl":null,"url":null,"abstract":"<p><p><i>Background:</i>The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). <i>Purpose</i>: We sought to characterize LLM responses to questions about piriformis syndrome (PS). <i>Methods</i>: On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. <i>Results</i>: The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. <i>Conclusion</i>: Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.</p>","PeriodicalId":35357,"journal":{"name":"Hss Journal","volume":" ","pages":"15563316251340697"},"PeriodicalIF":1.3000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12092406/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Hss Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/15563316251340697","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background:The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). Purpose: We sought to characterize LLM responses to questions about piriformis syndrome (PS). Methods: On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. Results: The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. Conclusion: Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.

梨状肌综合征大语言模型的能力:质量、准确性、完整性和可读性研究。
背景:人工智能的扩散导致患者广泛使用大型语言模型(llm)。目的:我们试图描述LLM对梨状肌综合征(PS)问题的反应。方法:在2024年8月15日,我们询问了3位法学硕士(chatgpt -4, Copilot和gemini),以回答谷歌Trends跟踪的关于PS的25个最常见问题。我们根据李克特量表评估回答的准确性和完整性。我们使用确保患者质量信息(EQIP)工具来评估应答的质量,并使用Flesch-Kincaid阅读难度(FKRE)和Flesch-Kincaid分级水平(FKGL)评分来评估可读性。结果:ChatGPT、Copilot和Gemini的平均完整度评分分别为2.8±0.3、2.2±0.6和2.6±0.4。llm的平均完整性评分有显著差异。在两两比较中,ChatGPT和Gemini优于Copilot。在平均准确性得分方面,llm之间没有显著差异。在可读性分析中,在FKRE得分方面没有发现显著差异。然而,FKGL评分有显著差异。在根据EQIP评分进行的质量分析中,llm之间存在显著差异。结论:虽然llm在医疗保健中的应用很有前景,但我们的研究结果表明,这些技术需要改进,以便在PS上为普通受众提供更好的准确性、完整性、质量和可读性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Hss Journal
Hss Journal Medicine-Surgery
CiteScore
3.90
自引率
0.00%
发文量
42
期刊介绍: The HSS Journal is the Musculoskeletal Journal of Hospital for Special Surgery. The aim of the HSS Journal is to promote cutting edge research, clinical pathways, and state-of-the-art techniques that inform and facilitate the continuing education of the orthopaedic and musculoskeletal communities. HSS Journal publishes articles that offer contributions to the advancement of the knowledge of musculoskeletal diseases and encourages submission of manuscripts from all musculoskeletal disciplines.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信