Comparative evaluation of large language models in delivering guideline-compliant recommendations for topical NSAID use in musculoskeletal pain: a multidimensional analysis.

IF 2.8 3区 医学 Q2 RHEUMATOLOGY
Chengqi Dong, Xu Qiu, Jiayi Deng, Li Xu, Xiaoxue Dong, Shi Chen, Tao Mei, Qinghua Li, Yuan Cheng, Jianliang Sun, Hanbin Wang, Liang Yu
{"title":"Comparative evaluation of large language models in delivering guideline-compliant recommendations for topical NSAID use in musculoskeletal pain: a multidimensional analysis.","authors":"Chengqi Dong, Xu Qiu, Jiayi Deng, Li Xu, Xiaoxue Dong, Shi Chen, Tao Mei, Qinghua Li, Yuan Cheng, Jianliang Sun, Hanbin Wang, Liang Yu","doi":"10.1007/s10067-025-07640-4","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>While large language models (LLMs) are increasingly used in clinical decision support, their adherence to evidence-based guidelines-particularly for musculoskeletal pain management-remains understudied.</p><p><strong>Methods: </strong>Four LLMs (DeepSeek-R1, ChatGPT-4o, Gemini, Grok-3) were evaluated on their responses to topical NSAID use for musculoskeletal pain through: assessments of response quality (accuracy, over-conclusiveness, supplementary information, and incompleteness), standardized readability metrics (Flesch Reading Ease, Flesch-Kincaid Grade Level), and the PEMAT-P tool to quantify actionability.</p><p><strong>Results: </strong>The four LLMs showed significant variability in accuracy (ANOVA p = 0.045), with Gemini scoring highest (8.33 ± 0.77) and DeepSeek-R1 lowest (7.72 ± 1.52) and in over-conclusiveness (ANOVA p = 0.025), with Grok-3 scoring lowest (4.56 ± 1.42) and ChatGPT-4o highest 6.72 ± 1.49). ChatGPT-4o provided the most supplementary content (6.94 ± 2.29, p = 0.106) and DeepSeek-R1 had the highest incompleteness (5.00 ± 2.52, p = 0.261). All models exceeded recommended readability thresholds (9th-10th grade level), and none met the actionability standard (≤ 33.5%).</p><p><strong>Conclusions: </strong>LLMs demonstrate potential as clinical aids. The comprehensive performance of Gemini and Grok is relatively favorable, yet their readability and actionability remain unsatisfactory. Future development should integrate clinician feedback and real-world validation to ensure safety. Human oversight and targeted AI training are critical for safe implementation. Key Points • The study reveals significant differences in accuracy among LLMs, highlighting inconsistencies in clinical decision support. • While all models generated readable text, the complexity remained high, potentially limiting accessibility for some patients. • Glucocorticoid use for patients in remission was more strongly associated with impaired physical function in patients aged 75-84 than in patients aged 55-74 years. • Over-conclusiveness and incomplete adherence to evidence-based guidelines underscore the necessity for human oversight and targeted AI training in clinical applications.</p>","PeriodicalId":10482,"journal":{"name":"Clinical Rheumatology","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Rheumatology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10067-025-07640-4","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"RHEUMATOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: While large language models (LLMs) are increasingly used in clinical decision support, their adherence to evidence-based guidelines-particularly for musculoskeletal pain management-remains understudied.

Methods: Four LLMs (DeepSeek-R1, ChatGPT-4o, Gemini, Grok-3) were evaluated on their responses to topical NSAID use for musculoskeletal pain through: assessments of response quality (accuracy, over-conclusiveness, supplementary information, and incompleteness), standardized readability metrics (Flesch Reading Ease, Flesch-Kincaid Grade Level), and the PEMAT-P tool to quantify actionability.

Results: The four LLMs showed significant variability in accuracy (ANOVA p = 0.045), with Gemini scoring highest (8.33 ± 0.77) and DeepSeek-R1 lowest (7.72 ± 1.52) and in over-conclusiveness (ANOVA p = 0.025), with Grok-3 scoring lowest (4.56 ± 1.42) and ChatGPT-4o highest 6.72 ± 1.49). ChatGPT-4o provided the most supplementary content (6.94 ± 2.29, p = 0.106) and DeepSeek-R1 had the highest incompleteness (5.00 ± 2.52, p = 0.261). All models exceeded recommended readability thresholds (9th-10th grade level), and none met the actionability standard (≤ 33.5%).

Conclusions: LLMs demonstrate potential as clinical aids. The comprehensive performance of Gemini and Grok is relatively favorable, yet their readability and actionability remain unsatisfactory. Future development should integrate clinician feedback and real-world validation to ensure safety. Human oversight and targeted AI training are critical for safe implementation. Key Points • The study reveals significant differences in accuracy among LLMs, highlighting inconsistencies in clinical decision support. • While all models generated readable text, the complexity remained high, potentially limiting accessibility for some patients. • Glucocorticoid use for patients in remission was more strongly associated with impaired physical function in patients aged 75-84 than in patients aged 55-74 years. • Over-conclusiveness and incomplete adherence to evidence-based guidelines underscore the necessity for human oversight and targeted AI training in clinical applications.

大型语言模型在提供符合指南的非甾体抗炎药局部应用于肌肉骨骼疼痛的建议方面的比较评估:多维分析。
导言:虽然大型语言模型(llm)越来越多地用于临床决策支持,但它们对循证指南的依从性——特别是对肌肉骨骼疼痛管理的依从性——仍未得到充分研究。方法:四个llm (DeepSeek-R1, chatgpt - 40, Gemini, Grok-3)对局部使用非甾体抗炎药治疗肌肉骨骼疼痛的反应进行评估:反应质量评估(准确性,过度结论性,补充信息和不完整性),标准化可读性指标评估(Flesch Reading Ease, Flesch- kincaid Grade Level),以及量化可操作性的PEMAT-P工具。结果:4种LLMs在准确性方面存在显著差异(ANOVA p = 0.045),其中Gemini评分最高(8.33±0.77),DeepSeek-R1评分最低(7.72±1.52);在过结论性方面(ANOVA p = 0.025), Grok-3评分最低(4.56±1.42),chatgpt - 40评分最高(6.72±1.49)。chatgpt - 40补充含量最多(6.94±2.29,p = 0.106), DeepSeek-R1不完整度最高(5.00±2.52,p = 0.261)。所有模型均超过推荐的可读性阈值(9 -10年级水平),均未达到可操作性标准(≤33.5%)。结论:llm具有作为临床辅助工具的潜力。Gemini和Grok的综合性能相对较好,但其可读性和可操作性仍不尽人意。未来的发展应结合临床医生的反馈和实际验证,以确保安全性。人为监督和有针对性的人工智能培训对于安全实施至关重要。•该研究揭示了llm之间准确性的显著差异,突出了临床决策支持的不一致性。•虽然所有模型都生成了可读的文本,但复杂性仍然很高,可能限制了一些患者的可访问性。•与55-74岁患者相比,75-84岁缓解期患者使用糖皮质激素与身体功能受损的相关性更强。•过度结论性和对循证指南的不完全遵守强调了在临床应用中进行人类监督和有针对性的人工智能培训的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Clinical Rheumatology
Clinical Rheumatology 医学-风湿病学
CiteScore
6.90
自引率
2.90%
发文量
441
审稿时长
3 months
期刊介绍: Clinical Rheumatology is an international English-language journal devoted to publishing original clinical investigation and research in the general field of rheumatology with accent on clinical aspects at postgraduate level. The journal succeeds Acta Rheumatologica Belgica, originally founded in 1945 as the official journal of the Belgian Rheumatology Society. Clinical Rheumatology aims to cover all modern trends in clinical and experimental research as well as the management and evaluation of diagnostic and treatment procedures connected with the inflammatory, immunologic, metabolic, genetic and degenerative soft and hard connective tissue diseases.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信