Audrey Y Su, Ashley Knebel, Andrew Y Xu, Marco Kaper, Phillip Schmitt, Joseph E Nassar, Manjot Singh, Michael J Farias, Jinho Kim, Bassel G Diebo, Alan H Daniels
{"title":"Evaluation of retrieval-augmented generation and large language models in clinical guidelines for degenerative spine conditions.","authors":"Audrey Y Su, Ashley Knebel, Andrew Y Xu, Marco Kaper, Phillip Schmitt, Joseph E Nassar, Manjot Singh, Michael J Farias, Jinho Kim, Bassel G Diebo, Alan H Daniels","doi":"10.1007/s00586-025-08994-8","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Degenerative spinal diseases often require complex, patient-specific treatment, presenting a compelling challenge for artificial intelligence (AI) integration into clinical practice. While existing literature has focused on ChatGPT-4o performance in individual spine conditions, this study compares ChatGPT-4o, a traditional large language model (LLM), against NotebookLM, a novel retrieval-augmented model (RAG-LLM) supplemented with North American Spine Society (NASS) guidelines, for concordance with all five published NASS guidelines for degenerative spinal diseases.</p><p><strong>Methods: </strong>A total of 118 questions from NASS guidelines regarding five degenerative spinal conditions were presented to ChatGPT-4o and NotebookLM. All responses were scored based on accuracy, evidence-based conclusions, supplementary and complete information.</p><p><strong>Results: </strong>Overall, NotebookLM provided significantly more accurate responses (98.3% vs. 40.7%, p < 0.05), more evidence-based conclusions (99.1% vs. 40.7%, p < 0.05), and more complete information (94.1% vs. 79.7%, p < 0.05), while ChatGPT-4o provided more supplementary information (98.3% vs. 67.8%, p < 0.05). These discrepancies became most prominent in nonsurgical and surgical interventions, wherein ChatGPT often produced recommendations with unsubstantiated certainty.</p><p><strong>Conclusion: </strong>While RAG-LLMs are a promising tool for clinical decision-making assistance and show significant improvement from prior models, physicians should remain cautious when integrating AI into patient care, especially in the context of nuanced medical scenarios.</p>","PeriodicalId":12323,"journal":{"name":"European Spine Journal","volume":" ","pages":""},"PeriodicalIF":2.6000,"publicationDate":"2025-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"European Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00586-025-08994-8","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: Degenerative spinal diseases often require complex, patient-specific treatment, presenting a compelling challenge for artificial intelligence (AI) integration into clinical practice. While existing literature has focused on ChatGPT-4o performance in individual spine conditions, this study compares ChatGPT-4o, a traditional large language model (LLM), against NotebookLM, a novel retrieval-augmented model (RAG-LLM) supplemented with North American Spine Society (NASS) guidelines, for concordance with all five published NASS guidelines for degenerative spinal diseases.
Methods: A total of 118 questions from NASS guidelines regarding five degenerative spinal conditions were presented to ChatGPT-4o and NotebookLM. All responses were scored based on accuracy, evidence-based conclusions, supplementary and complete information.
Results: Overall, NotebookLM provided significantly more accurate responses (98.3% vs. 40.7%, p < 0.05), more evidence-based conclusions (99.1% vs. 40.7%, p < 0.05), and more complete information (94.1% vs. 79.7%, p < 0.05), while ChatGPT-4o provided more supplementary information (98.3% vs. 67.8%, p < 0.05). These discrepancies became most prominent in nonsurgical and surgical interventions, wherein ChatGPT often produced recommendations with unsubstantiated certainty.
Conclusion: While RAG-LLMs are a promising tool for clinical decision-making assistance and show significant improvement from prior models, physicians should remain cautious when integrating AI into patient care, especially in the context of nuanced medical scenarios.
期刊介绍:
"European Spine Journal" is a publication founded in response to the increasing trend toward specialization in spinal surgery and spinal pathology in general. The Journal is devoted to all spine related disciplines, including functional and surgical anatomy of the spine, biomechanics and pathophysiology, diagnostic procedures, and neurology, surgery and outcomes. The aim of "European Spine Journal" is to support the further development of highly innovative spine treatments including but not restricted to surgery and to provide an integrated and balanced view of diagnostic, research and treatment procedures as well as outcomes that will enhance effective collaboration among specialists worldwide. The “European Spine Journal” also participates in education by means of videos, interactive meetings and the endorsement of educative efforts.
Official publication of EUROSPINE, The Spine Society of Europe