Burak Tayyip Dede, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer, Mustafa Turgut Yıldızgören
{"title":"梨状肌综合征大语言模型的能力:质量、准确性、完整性和可读性研究。","authors":"Burak Tayyip Dede, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer, Mustafa Turgut Yıldızgören","doi":"10.1177/15563316251340697","DOIUrl":null,"url":null,"abstract":"<p><p><i>Background:</i>The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). <i>Purpose</i>: We sought to characterize LLM responses to questions about piriformis syndrome (PS). <i>Methods</i>: On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. <i>Results</i>: The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. <i>Conclusion</i>: Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.</p>","PeriodicalId":35357,"journal":{"name":"Hss Journal","volume":" ","pages":"15563316251340697"},"PeriodicalIF":1.3000,"publicationDate":"2025-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12092406/pdf/","citationCount":"0","resultStr":"{\"title\":\"Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.\",\"authors\":\"Burak Tayyip Dede, Muhammed Oğuz, Bülent Alyanak, Fatih Bağcıer, Mustafa Turgut Yıldızgören\",\"doi\":\"10.1177/15563316251340697\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p><i>Background:</i>The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). <i>Purpose</i>: We sought to characterize LLM responses to questions about piriformis syndrome (PS). <i>Methods</i>: On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. <i>Results</i>: The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. <i>Conclusion</i>: Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.</p>\",\"PeriodicalId\":35357,\"journal\":{\"name\":\"Hss Journal\",\"volume\":\" \",\"pages\":\"15563316251340697\"},\"PeriodicalIF\":1.3000,\"publicationDate\":\"2025-05-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12092406/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Hss Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1177/15563316251340697\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Hss Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1177/15563316251340697","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
Competencies of Large Language Models About Piriformis Syndrome: Quality, Accuracy, Completeness, and Readability Study.
Background:The proliferation of artificial intelligence has led to widespread patient use of large language models (LLMs). Purpose: We sought to characterize LLM responses to questions about piriformis syndrome (PS). Methods: On August 15, 2024, we asked 3 LLMs-ChatGPT-4, Copilot, and Gemini-to respond to the 25 most frequently asked questions about PS, as tracked by Google Trends. We evaluated the accuracy and completeness of the responses according to the Likert scale. We used the Ensuring Quality Information for Patients (EQIP) tool to assess the quality of the responses and assessed readability using Flesch-Kincaid Reading Ease (FKRE) and Flesch-Kincaid Grade Level (FKGL) scores. Results: The mean completeness scores of the responses obtained from ChatGPT, Copilot, and Gemini were 2.8 ± 0.3, 2.2 ± 0.6, and 2.6 ± 0.4, respectively. There was a significant difference in the mean completeness score among LLMs. In pairwise comparisons, ChatGPT and Gemini were superior to Copilot. There was no significant difference between the LLMs in terms of mean accuracy scores. In readability analyses, no significant difference was found in terms of FKRE scores. However, a significant difference was found in FKGL scores. A significant difference between LLMs was identified in the quality analysis performed according to EQIP scores. Conclusion: Although the use of LLMs in healthcare is promising, our findings suggest that these technologies need to be improved to perform better in terms of accuracy, completeness, quality, and readability on PS for a general audience.
期刊介绍:
The HSS Journal is the Musculoskeletal Journal of Hospital for Special Surgery. The aim of the HSS Journal is to promote cutting edge research, clinical pathways, and state-of-the-art techniques that inform and facilitate the continuing education of the orthopaedic and musculoskeletal communities. HSS Journal publishes articles that offer contributions to the advancement of the knowledge of musculoskeletal diseases and encourages submission of manuscripts from all musculoskeletal disciplines.