Amy Boyle, Bright Huo, Patricia Sylla, Elisa Calabrese, Sunjay Kumar, Bethany J Slater, Danielle S Walsh, R Wesley Vosburg
{"title":"Large language model-generated clinical practice guideline for appendicitis.","authors":"Amy Boyle, Bright Huo, Patricia Sylla, Elisa Calabrese, Sunjay Kumar, Bethany J Slater, Danielle S Walsh, R Wesley Vosburg","doi":"10.1007/s00464-025-11723-3","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.</p><p><strong>Methods: </strong>Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.</p><p><strong>Results: </strong>Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.</p><p><strong>Conclusions: </strong>LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.</p>","PeriodicalId":22174,"journal":{"name":"Surgical Endoscopy And Other Interventional Techniques","volume":" ","pages":"3539-3551"},"PeriodicalIF":2.4000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgical Endoscopy And Other Interventional Techniques","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00464-025-11723-3","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/18 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Clinical practice guidelines provide important evidence-based recommendations to optimize patient care, but their development is labor-intensive and time-consuming. Large language models have shown promise in supporting academic writing and the development of systematic reviews, but their ability to assist with guideline development has not been explored. In this study, we tested the capacity of LLMs to support each stage of guideline development, using the latest SAGES guideline on the surgical management of appendicitis as a comparison.
Methods: Prompts were engineered to trigger LLMs to perform each task of guideline development, using key questions and PICOs derived from the SAGES guideline. ChatGPT-4, Google Gemini, Consensus, and Perplexity were queried on February 21, 2024. LLM performance was evaluated qualitatively, with narrative descriptions of each task's output. The Appraisal of Guidelines for Research and Evaluation in Surgery (AGREE-S) instrument was used to quantitatively assess the quality of the LLM-derived guideline compared to the existing SAGES guideline.
Results: Popular LLMs were able to generate a search syntax, perform data analysis, and follow the GRADE approach and Evidence-to-Decision framework to produce guideline recommendations. These LLMs were unable to independently perform a systematic literature search or reliably perform screening, data extraction, or risk of bias assessment at the time of testing. AGREE-S appraisal produced a total score of 119 for the LLM-derived guideline and 156 for the SAGES guideline. In 19 of the 24 domains, the two guidelines scored within two points of each other.
Conclusions: LLMs demonstrate potential to assist with certain steps of guideline development, which may reduce time and resource burden associated with these tasks. As new models are developed, the role for LLMs in guideline development will continue to evolve. Ongoing research and multidisciplinary collaboration are needed to support the safe and effective integration of LLMs in each step of guideline development.
期刊介绍:
Uniquely positioned at the interface between various medical and surgical disciplines, Surgical Endoscopy serves as a focal point for the international surgical community to exchange information on practice, theory, and research.
Topics covered in the journal include:
-Surgical aspects of:
Interventional endoscopy,
Ultrasound,
Other techniques in the fields of gastroenterology, obstetrics, gynecology, and urology,
-Gastroenterologic surgery
-Thoracic surgery
-Traumatic surgery
-Orthopedic surgery
-Pediatric surgery