梨状肌综合征大语言模型的能力：质量、准确性、完整性和可读性研究。

IF 1.3 4区医学 Q3 ORTHOPEDICS

Hss Journal Pub Date : 2025-05-20 DOI:10.1177/15563316251340697

Burak Tayyip Dede, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer, Mustafa Turgut Yıldızgören

{"title":"梨状肌综合征大语言模型的能力：质量、准确性、完整性和可读性研究。","authors":"Burak Tayyip Dede, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer, Mustafa Turgut Yıldızgören","doi":"10.1177/15563316251340697","DOIUrl":null,"url":null,"abstract":"Background:The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). Purpose: We sought to characterize LLM responses to questions about piriformis syndrome (PS). Methods: On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. Results: The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. Conclusion: Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.","PeriodicalId":35357,"journal":{"name":"Hss Journal","volume":" ","pages":"15563316251340697"},"PeriodicalIF":1.3000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12092406/pdf/","citationCount":"0","resultStr":"{\"title\":\"Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.\",\"authors\":\"Burak Tayyip Dede, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer, Mustafa Turgut Yıldızgören\",\"doi\":\"10.1177/15563316251340697\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background:The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). Purpose: We sought to characterize LLM responses to questions about piriformis syndrome (PS). Methods: On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. Results: The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. Conclusion: Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.\",\"PeriodicalId\":35357,\"journal\":{\"name\":\"Hss Journal\",\"volume\":\" \",\"pages\":\"15563316251340697\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2025-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12092406/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Hss Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/15563316251340697\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Hss Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/15563316251340697","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：人工智能的扩散导致患者广泛使用大型语言模型（llm）。目的：我们试图描述LLM对梨状肌综合征（PS）问题的反应。方法：在2024年8月15日，我们询问了3位法学硕士（chatgpt -4， Copilot和gemini），以回答谷歌Trends跟踪的关于PS的25个最常见问题。我们根据李克特量表评估回答的准确性和完整性。我们使用确保患者质量信息（EQIP）工具来评估应答的质量，并使用Flesch-Kincaid阅读难度（FKRE）和Flesch-Kincaid分级水平（FKGL）评分来评估可读性。结果：ChatGPT、Copilot和Gemini的平均完整度评分分别为2.8±0.3、2.2±0.6和2.6±0.4。llm的平均完整性评分有显著差异。在两两比较中，ChatGPT和Gemini优于Copilot。在平均准确性得分方面，llm之间没有显著差异。在可读性分析中，在FKRE得分方面没有发现显著差异。然而，FKGL评分有显著差异。在根据EQIP评分进行的质量分析中，llm之间存在显著差异。结论：虽然llm在医疗保健中的应用很有前景，但我们的研究结果表明，这些技术需要改进，以便在PS上为普通受众提供更好的准确性、完整性、质量和可读性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.

Background:The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). Purpose: We sought to characterize LLM responses to questions about piriformis syndrome (PS). Methods: On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. Results: The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. Conclusion: Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Hss Journal Medicine-Surgery

CiteScore

3.90

自引率

0.00%

发文量

期刊介绍： The HSS Journal is the Musculoskeletal Journal of Hospital for Special Surgery. The aim of the HSS Journal is to promote cutting edge research, clinical pathways, and state-of-the-art techniques that inform and facilitate the continuing education of the orthopaedic and musculoskeletal communities. HSS Journal publishes articles that offer contributions to the advancement of the knowledge of musculoskeletal diseases and encourages submission of manuscripts from all musculoskeletal disciplines.