Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.

IF 4.4 1区医学 Q1 ORTHOPEDICS

Arthroscopy-The Journal of Arthroscopic and Related Surgery Pub Date : 2024-08-22 DOI:10.1016/j.arthro.2024.07.040

Benedict U Nwachukwu, Nathan H Varady, Answorth A Allen, Joshua S Dines, David W Altchek, Riley J Williams, Kyle N Kunze

{"title":"Currently Available Large Language Models Do Not Provide Musculoskeletal Treatment Recommendations That Are Concordant With Evidence-Based Clinical Practice Guidelines.","authors":"Benedict U Nwachukwu, Nathan H Varady, Answorth A Allen, Joshua S Dines, David W Altchek, Riley J Williams, Kyle N Kunze","doi":"10.1016/j.arthro.2024.07.040","DOIUrl":null,"url":null,"abstract":"Purpose: To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS).Methods: All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test.Results: Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations.Conclusions: Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided.Clinical relevance: Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.","PeriodicalId":55459,"journal":{"name":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","volume":null,"pages":null},"PeriodicalIF":4.4000,"publicationDate":"2024-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.arthro.2024.07.040","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: To determine whether several leading, commercially available large language models (LLMs) provide treatment recommendations concordant with evidence-based clinical practice guidelines (CPGs) developed by the American Academy of Orthopaedic Surgeons (AAOS).

Methods: All CPGs concerning the management of rotator cuff tears (n = 33) and anterior cruciate ligament injuries (n = 15) were extracted from the AAOS. Treatment recommendations from Chat-Generative Pretrained Transformer version 4 (ChatGPT-4), Gemini, Mistral-7B, and Claude-3 were graded by 2 blinded physicians as being concordant, discordant, or indeterminate (i.e., neutral response without definitive recommendation) with respect to AAOS CPGs. The overall concordance between LLM and AAOS recommendations was quantified, and the comparative overall concordance of recommendations among the 4 LLMs was evaluated through the Fisher exact test.

Results: Overall, 135 responses (70.3%) were concordant, 43 (22.4%) were indeterminate, and 14 (7.3%) were discordant. Inter-rater reliability for concordance classification was excellent (κ = 0.92). Concordance with AAOS CPGs was most frequently observed with ChatGPT-4 (n = 38, 79.2%) and least frequently observed with Mistral-7B (n = 28, 58.3%). Indeterminate recommendations were most frequently observed with Mistral-7B (n = 17, 35.4%) and least frequently observed with Claude-3 (n = 8, 6.7%). Discordant recommendations were most frequently observed with Gemini (n = 6, 12.5%) and least frequently observed with ChatGPT-4 (n = 1, 2.1%). Overall, no statistically significant difference in concordant recommendations was observed across LLMs (P = .12). Of all recommendations, only 20 (10.4%) were transparent and provided references with full bibliographic details or links to specific peer-reviewed content to support recommendations.

Conclusions: Among leading commercially available LLMs, more than 1-in-4 recommendations concerning the evaluation and management of rotator cuff and anterior cruciate ligament injuries do not reflect current evidence-based CPGs. Although ChatGPT-4 showed the highest performance, clinically significant rates of recommendations without concordance or supporting evidence were observed. Only 10% of responses by LLMs were transparent, precluding users from fully interpreting the sources from which recommendations were provided.

Clinical relevance: Although leading LLMs generally provide recommendations concordant with CPGs, a substantial error rate exists, and the proportion of recommendations that do not align with these CPGs suggests that LLMs are not trustworthy clinical support tools at this time. Each off-the-shelf, closed-source LLM has strengths and weaknesses. Future research should evaluate and compare multiple LLMs to avoid bias associated with narrow evaluation of few models as observed in the current literature.

查看原文本刊更多论文

目前可用的大语言模型无法提供与循证临床实践指南一致的肌肉骨骼治疗建议。

目的：确定市场上销售的主要 LLM 是否提供与美国骨科外科医生学会（AAOS）制定的循证临床实践指南（CPG）一致的治疗建议：从 AAOS 中提取了所有关于肩袖撕裂（33 条）和前交叉韧带损伤（15 条）治疗的 CPG。聊天生成预训练转换器第四版[ChatGPT-4；OpenAI]、Gemini（谷歌）、Mistral-7B（Mistral AI）和Claude-3（Anthropic）的治疗建议由两名盲人医生根据AAOS CPGs分为 "一致"、"不一致 "或 "不确定"（即没有明确建议的中性反应）。我们对 LLM 与 AAOS 建议之间的总体一致性进行了量化，并通过费舍尔精确检验对四个 LLM 之间建议的总体一致性进行了比较评估：结果：共有 135 份（70.3%）答复一致，43 份（22.4%）答复不一致，14 份（7.3%）答复不一致。评分者之间的一致性分类可靠性极佳（Kappa=0.92）。与 AAOS CPGs 一致最多的是 ChatGPT-4（38 人，79.2%），最少的是 Mistral-7B（28 人，58.3%）。Mistral-7B最常出现不确定的建议（n=17，35.4%），Claude-3最少（n=8，6.7%）。Gemini最常出现不一致的建议（n=6，12.5%），ChatGPT-4最少（n=1，2.1%）。总体而言，不同 LLM 的建议不一致情况无统计学差异（P=0.12）。在所有建议中，只有 20 项（10.4%）是透明的，并提供了完整的参考文献或链接到具体的同行评审内容以支持建议：结论：在主要的市售 LLM 中，超过四分之一的有关肩袖和前交叉韧带损伤评估和管理的建议没有反映当前基于证据的 CPG。虽然 ChatGPT-4 的性能最高，但临床上仍观察到了大量无一致性或无支持证据的建议。只有 10% 的 LLMs 答复是透明的，这使得用户无法全面解释提供建议的来源：临床相关性：虽然领先的 LLMs 一般都会提供与 CPGs 一致的建议，但也存在很大的错误率，而且与这些 CPGs 不一致的建议比例表明，LLMs 目前还不是值得信赖的临床支持工具。每种现成的封闭源 LLM 都有优点和缺点。未来的研究应该对多种 LLM 进行评估和比较，以避免出现目前文献中观察到的只对少数模型进行狭隘评估的偏差。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Arthroscopy-The Journal of Arthroscopic and Related Surgery 医学-外科

CiteScore

9.30

自引率

17.00%

发文量

555

审稿时长

58 days

期刊介绍： Nowhere is minimally invasive surgery explained better than in Arthroscopy, the leading peer-reviewed journal in the field. Every issue enables you to put into perspective the usefulness of the various emerging arthroscopic techniques. The advantages and disadvantages of these methods -- along with their applications in various situations -- are discussed in relation to their efficiency, efficacy and cost benefit. As a special incentive, paid subscribers also receive access to the journal expanded website.

文献相关原料

公司名称	产品信息	采购帮参考价格