Large language model-generated clinical practice guideline for appendicitis.

IF 2.4 2区医学 Q2 SURGERY

Surgical Endoscopy And Other Interventional Techniques Pub Date : 2025-06-01 Epub Date: 2025-04-18 DOI:10.1007/s00464-025-11723-3

Amy Boyle, Bright Huo, Patricia Sylla, Elisa Calabrese, Sunjay Kumar, Bethany J Slater, Danielle S Walsh, R Wesley Vosburg

{"title":"Large language model-generated clinical practice guideline for appendicitis.","authors":"Amy Boyle, Bright Huo, Patricia Sylla, Elisa Calabrese, Sunjay Kumar, Bethany J Slater, Danielle S Walsh, R Wesley Vosburg","doi":"10.1007/s00464-025-11723-3","DOIUrl":null,"url":null,"abstract":"Background: Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.Methods: Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.Results: Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.Conclusions: LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.","PeriodicalId":22174,"journal":{"name":"Surgical Endoscopy And Other Interventional Techniques","volume":" ","pages":"3539-3551"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgical Endoscopy And Other Interventional Techniques","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00464-025-11723-3","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/18 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"SURGERY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.

Methods: Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.

Results: Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.

Conclusions: LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.

查看原文本刊更多论文

大语言模型生成的阑尾炎临床实践指南。

背景：临床实践指南为优化患者护理提供了重要的循证建议，但其制定是劳动密集型和耗时的。大型语言模型在支持学术写作和系统综述的发展方面显示出了希望，但它们协助指南发展的能力尚未得到探索。在这项研究中，我们测试了llm支持指南制定各个阶段的能力，使用最新的SAGES阑尾炎手术治疗指南作为比较。方法：使用SAGES指南衍生的关键问题和pico，设计提示以触发llm执行指南制定的每个任务。ChatGPT-4，谷歌Gemini， Consensus和Perplexity于2024年2月21日进行了查询。对LLM的表现进行定性评估，对每个任务的输出进行叙述性描述。使用外科研究和评估指南评估（AGREE-S）工具定量评估llm衍生指南与现有SAGES指南的质量。结果：受欢迎的法学硕士能够生成搜索语法，执行数据分析，并遵循GRADE方法和证据到决策框架来产生指导建议。这些法学硕士无法独立进行系统的文献检索或可靠地进行筛选、数据提取或在测试时进行偏倚风险评估。AGREE-S评估得出llm衍生指南的总分为119分，SAGES指南的总分为156分。在24个领域中，有19个领域的得分相差不超过2分。结论：法学硕士展示了在指南制定的某些步骤中提供帮助的潜力，这可能会减少与这些任务相关的时间和资源负担。随着新模型的发展，法学硕士在指南制定中的作用将继续发展。需要持续的研究和多学科合作来支持在指南制定的每个步骤中安全有效地整合法学硕士。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Surgical Endoscopy And Other Interventional Techniques 医学-外科

CiteScore

6.10

自引率

12.90%

发文量

890

审稿时长

6 months

期刊介绍： Uniquely positioned at the interface between various medical and surgical disciplines, Surgical Endoscopy serves as a focal point for the international surgical community to exchange information on practice, theory, and research. Topics covered in the journal include: -Surgical aspects of: Interventional endoscopy, Ultrasound, Other techniques in the fields of gastroenterology, obstetrics, gynecology, and urology, -Gastroenterologic surgery -Thoracic surgery -Traumatic surgery -Orthopedic surgery -Pediatric surgery